eeff92ad32
rustdoc-search: use set ops for ranking and filtering This commit adds ranking and quick filtering to type-based search, improving performance and having it order results based on their type signatures. Preview ------- Profiler output: https://notriddle.com/rustdoc-html-demo-6/profile-8/index.html Preview: https://notriddle.com/rustdoc-html-demo-6/ranking-and-filtering-v2/std/index.html Motivation ---------- If I write a query like `str -> String`, a lot of functions come up. That's to be expected, but `String::from` should come up on top, and it doesn't right now. This is because the sorting algorithm is based on the functions name, and doesn't consider the type signature at all. `slice::join` even comes up above it! To fix this, the sorting should take into account the function's signature, and the closer match should come up on top. Guide-level description ----------------------- When searching by type signature, types with a "closer" match will show up above types that match less precisely. Reference-level explanation --------------------------- Functions signature search works in three major phases: * A compact "fingerprint," based on the [bloom filter] technique, is used to check for matches and to estimate the distance. It sometimes has false positive matches, but it also operates on 128 bit contiguous memory and requires no backtracking, so it performs a lot better than real unification. The fingerprint represents the set of items in the type signature, but it does not represent nesting, and it ignores when the same item appears more than once. The result is rejected if any query bits are absent in the function, or if the distance is higher than the current maximum and 200 results have already been found. * The second step performs unification. This is where nesting and true bag semantics are taken into account, and it has no false positives. It uses a recursive, backtracking algorithm. The result is rejected if any query elements are absent in the function. [bloom filter]: https://en.wikipedia.org/wiki/Bloom_filter Drawbacks --------- This makes the code bigger. More than that, this design is a subtle trade-off. It makes the cases I've tested against measurably faster, but it's not clear how well this extends to other crates with potentially more functions and fewer types. The more complex things get, the more important it is to gather a good set of data to test with (this is arguably more important than the actual benchmarking ifrastructure right now). Rationale and alternatives -------------------------- Throwing a bloom filter in front makes it faster. More than that, it tries to take a tactic where the system can not only check for potential matches, but also gets an accurate distance function without needing to do unification. That way it can skip unification even on items that have the needed elems, as long as they have more items than the currently found maximum. If I didn't want to be able to cheaply do set operations on the fingerprint, a [cuckoo filter] is supposed to have better performance. But the nice bit-banging set intersection doesn't work AFAIK. I also looked into [minhashing], but since it's actually an unbiased estimate of the similarity coefficient, I'm not sure how it could be used to skip unification (I wouldn't know if the estimate was too low or too high). This function actually uses the number of distinct items as its "distance function." This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|F\cap{}Q|}{|F\cup{}Q|}$, while being cheaper to compute. This is because: * The function $F$ must be a superset of the query $Q$, so their union is just $F$ and the intersection is $Q$ and it can be reduced to $1-\frac{|Q|}{|F|}. * There are no magic thresholds. These values are only being used to compare against each other while sorting (and, if 200 results are found, to compare with the maximum match). This means we only care if one value is bigger than the other, not what it's actual value is, and since $Q$ is the same for everything, it can be safely left out, reducing the formula to $1-\frac{1}{|F|} = \frac{|F|}{|F|}-\frac{1}{|F|} = |F|-1$. And, since the values are only being compared with each other, $|F|$ is fine. Prior art --------- This is significantly different from how Hoogle does it. It doesn't account for order, and it has no special account for nesting, though `Box<t>` is still two items, while `t` is only one. This should give the same results that it would have gotten from a Jaccard Distance $1-\frac{|A\cap{}B|}{|A\cup{}B|}$, while being cheaper to compute. Unresolved questions -------------------- `[]` and `()`, the slice/array and tuple/union operators, are ignored while building the signature for the query. This is because they match more than one thing, making them ambiguous. Unfortunately, this also makes them a performance cliff. Is this likely to be a problem? Right now, the system just stashes the type distance into the same field that levenshtein distance normally goes in. This means exact query matches show up on top (for example, if you have a function like `fn nothing(a: Nothing, b: i32)`, then searching for `nothing` will show it on top even if there's another function with `fn bar(x: Nothing)` that's technically a closer match in type signature. Future possibilities -------------------- It should be possible to adopt more sorting criteria to act as a tie breaker, which could be determined during unification. [cuckoo filter]: https://en.wikipedia.org/wiki/Cuckoo_filter [minhashing]: https://en.wikipedia.org/wiki/MinHash |
||
---|---|---|
.github | ||
.reuse | ||
compiler | ||
library | ||
LICENSES | ||
src | ||
tests | ||
.editorconfig | ||
.git-blame-ignore-revs | ||
.gitattributes | ||
.gitignore | ||
.gitmodules | ||
.mailmap | ||
Cargo.lock | ||
Cargo.toml | ||
CODE_OF_CONDUCT.md | ||
config.example.toml | ||
configure | ||
CONTRIBUTING.md | ||
COPYRIGHT | ||
LICENSE-APACHE | ||
LICENSE-MIT | ||
README.md | ||
RELEASES.md | ||
rust-bors.toml | ||
rustfmt.toml | ||
triagebot.toml | ||
x | ||
x.ps1 | ||
x.py |
The Rust Programming Language
This is the main source code repository for Rust. It contains the compiler, standard library, and documentation.
Note: this README is for users rather than contributors. If you wish to contribute to the compiler, you should read CONTRIBUTING.md instead.
Table of Contents
Quick Start
Read "Installation" from The Book.
Installing from Source
The Rust build system uses a Python script called x.py
to build the compiler,
which manages the bootstrapping process. It lives at the root of the project.
It also uses a file named config.toml
to determine various configuration
settings for the build. You can see a full list of options in
config.example.toml
.
The x.py
command can be run directly on most Unix systems in the following
format:
./x.py <subcommand> [flags]
This is how the documentation and examples assume you are running x.py
.
See the rustc dev guide if this does not work on your
platform.
More information about x.py
can be found by running it with the --help
flag
or reading the rustc dev guide.
Dependencies
Make sure you have installed the dependencies:
python
3 or 2.7git
- A C compiler (when building for the host,
cc
is enough; cross-compiling may need additional compilers) curl
(not needed on Windows)pkg-config
if you are compiling on Linux and targeting Linuxlibiconv
(already included with glibc on Debian-based distros)
To build Cargo, you'll also need OpenSSL (libssl-dev
or openssl-devel
on
most Unix distros).
If building LLVM from source, you'll need additional tools:
g++
,clang++
, or MSVC with versions listed on LLVM's documentationninja
, or GNUmake
3.81 or later (Ninja is recommended, especially on Windows)cmake
3.13.4 or laterlibstdc++-static
may be required on some Linux distributions such as Fedora and Ubuntu
On tier 1 or tier 2 with host tools platforms, you can also choose to download
LLVM by setting llvm.download-ci-llvm = true
.
Otherwise, you'll need LLVM installed and llvm-config
in your path.
See the rustc-dev-guide for more info.
Building on a Unix-like system
Build steps
-
Clone the source with
git
:git clone https://github.com/rust-lang/rust.git cd rust
-
Configure the build settings:
./configure
If you plan to use
x.py install
to create an installation, it is recommended that you set theprefix
value in the[install]
section to a directory:./configure --set install.prefix=<path>
-
Build and install:
./x.py build && ./x.py install
When complete,
./x.py install
will place several programs into$PREFIX/bin
:rustc
, the Rust compiler, andrustdoc
, the API-documentation tool. By default, it will also include Cargo, Rust's package manager. You can disable this behavior by passing--set build.extended=false
to./configure
.
Configure and Make
This project provides a configure script and makefile (the latter of which just
invokes x.py
). ./configure
is the recommended way to programmatically
generate a config.toml
. make
is not recommended (we suggest using x.py
directly), but it is supported and we try not to break it unnecessarily.
./configure
make && sudo make install
configure
generates a config.toml
which can also be used with normal x.py
invocations.
Building on Windows
On Windows, we suggest using winget to install dependencies by running the following in a terminal:
winget install -e Python.Python.3
winget install -e Kitware.CMake
winget install -e Git.Git
Then edit your system's PATH
variable and add: C:\Program Files\CMake\bin
.
See
this guide on editing the system PATH
from the Java documentation.
There are two prominent ABIs in use on Windows: the native (MSVC) ABI used by Visual Studio and the GNU ABI used by the GCC toolchain. Which version of Rust you need depends largely on what C/C++ libraries you want to interoperate with. Use the MSVC build of Rust to interop with software produced by Visual Studio and the GNU build to interop with GNU software built using the MinGW/MSYS2 toolchain.
MinGW
MSYS2 can be used to easily build Rust on Windows:
-
Download the latest MSYS2 installer and go through the installer.
-
Run
mingw32_shell.bat
ormingw64_shell.bat
from the MSYS2 installation directory (e.g.C:\msys64
), depending on whether you want 32-bit or 64-bit Rust. (As of the latest version of MSYS2 you have to runmsys2_shell.cmd -mingw32
ormsys2_shell.cmd -mingw64
from the command line instead.) -
From this terminal, install the required tools:
# Update package mirrors (may be needed if you have a fresh install of MSYS2) pacman -Sy pacman-mirrors # Install build tools needed for Rust. If you're building a 32-bit compiler, # then replace "x86_64" below with "i686". If you've already got Git, Python, # or CMake installed and in PATH you can remove them from this list. # Note that it is important that you do **not** use the 'python2', 'cmake', # and 'ninja' packages from the 'msys2' subsystem. # The build has historically been known to fail with these packages. pacman -S git \ make \ diffutils \ tar \ mingw-w64-x86_64-python \ mingw-w64-x86_64-cmake \ mingw-w64-x86_64-gcc \ mingw-w64-x86_64-ninja
-
Navigate to Rust's source code (or clone it), then build it:
python x.py setup user && python x.py build && python x.py install
MSVC
MSVC builds of Rust additionally require an installation of Visual Studio 2017
(or later) so rustc
can use its linker. The simplest way is to get
Visual Studio, check the "C++ build tools" and "Windows 10 SDK" workload.
(If you're installing CMake yourself, be careful that "C++ CMake tools for Windows" doesn't get included under "Individual components".)
With these dependencies installed, you can build the compiler in a cmd.exe
shell with:
python x.py setup user
python x.py build
Right now, building Rust only works with some known versions of Visual Studio. If you have a more recent version installed and the build system doesn't understand, you may need to force rustbuild to use an older version. This can be done by manually calling the appropriate vcvars file before running the bootstrap.
CALL "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat"
python x.py build
Specifying an ABI
Each specific ABI can also be used from either environment (for example, using the GNU ABI in PowerShell) by using an explicit build triple. The available Windows build triples are:
- GNU ABI (using GCC)
i686-pc-windows-gnu
x86_64-pc-windows-gnu
- The MSVC ABI
i686-pc-windows-msvc
x86_64-pc-windows-msvc
The build triple can be specified by either specifying --build=<triple>
when
invoking x.py
commands, or by creating a config.toml
file (as described in
Building on a Unix-like system), and passing
--set build.build=<triple>
to ./configure
.
Building Documentation
If you'd like to build the documentation, it's almost the same:
./x.py doc
The generated documentation will appear under doc
in the build
directory for
the ABI used. That is, if the ABI was x86_64-pc-windows-msvc
, the directory
will be build\x86_64-pc-windows-msvc\doc
.
Notes
Since the Rust compiler is written in Rust, it must be built by a precompiled "snapshot" version of itself (made in an earlier stage of development). As such, source builds require an Internet connection to fetch snapshots, and an OS that can execute the available snapshot binaries.
See https://doc.rust-lang.org/nightly/rustc/platform-support.html for a list of supported platforms. Only "host tools" platforms have a pre-compiled snapshot binary available; to compile for a platform without host tools you must cross-compile.
You may find that other platforms work, but these are our officially supported build environments that are most likely to work.
Getting Help
See https://www.rust-lang.org/community for a list of chat platforms and forums.
Contributing
See CONTRIBUTING.md.
License
Rust is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), with portions covered by various BSD-like licenses.
See LICENSE-APACHE, LICENSE-MIT, and COPYRIGHT for details.
Trademark
The Rust Foundation owns and protects the Rust and Cargo trademarks and logos (the "Rust Trademarks").
If you want to use these names or brands, please read the media guide.
Third-party logos may be subject to third-party copyrights and trademarks. See Licenses for details.