2

Scaling Rust builds with Bazel

 1 year ago
source link: https://mmapped.blog/posts/17-scaling-rust-builds-with-bazel.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

The CI saga

Over the project life, we used various tools to mitigate the cargo's limitations, with mixed success.

The nix days

When we started the Rust implementation in mid-2019, we relied on nix to build all our software and set up the development environment in a cross-platform way (we develop both on macOS and Linux).

As our code base grew, we started to feel nix's limitations. The unit of caching in nix is a derivation. If we wanted to take full advantage of nix's caching capabilities, we would have to "nixify" all our external dependencies and internal Rust packages (one derivation per Rust package). After a long fight with build reproducibility issues, our glorious dev-infra team implemented fine-grained caching using the cargo2nix project.

Unfortunately, most developers in the team were uncomfortable with nix. It became a constant source of confusion and lost developer productivity. Since nix has a steep learning curve, only a few nix wizards could understand and modify the build rules. This nix-alienation bifurcated our build environment: the CI servers built the code with nix-build, and developers built the code by entering the nix-shell and invoking cargo.

The iceberg

The final blow to the nix story came around late-2020, close to the network launch. Our security team chose Ubuntu as the deployment target and insisted that production binaries link against the regularly updated system libraries (libc, libc++, openssl, etc.) the deployment platform provides. This setup is hard to achieve in nix without compromising correctnessWe considered using patchelf, but it's a bad idea in general: libc++ from nix packages can be incompatible with the one installed on the deployment platform..

Furthermore, the infrastructure team got a few new members unfamiliar with nix and decided to switch to a more familiar technology, Docker containers. The team implemented a new build system that runs cargo builds inside a docker container with the versions of dynamic libraries identical to those in the production environment.

The new system grew organically and eventually evolved into a hot mess of a hundred GitLab Yaml configuration files calling shell and python scripts in the correct order. These scripts used the known filesystem locations and environment variables to pass the build artifacts around. Most integration tests ended up as shell scripts expected some inputs that the CI pipeline produces.

The new Docker-based build system lost the granular caching capabilities of nix-build. The infra team attempted to build a custom caching system but eventually abandoned the project. Cache invalidation is a challenging problem indeed.

With the new system, the chasm between the CI and development environments deepened further because the nix-shell didn't go anywhere. The developers continued to use nix-shell for everyday development. It's hard to pinpoint the exact reason. I attribute that to the fact that entering the nix-shell is less invasive than entering a docker container, and nix-shell does not require running in a virtual machine on macOS (Rust compile times are slow). Also, the infra team was so busy rewriting the build system that improving the everyday developer experience was out of reach.

I call this setup an "iceberg": on the surface, a developer needed only nix and cargo to work on the code, but in practice, that was only 10% of the story. Since most tests required a CI environment, developers had to create merge requests to check whether their code worked beyond the basic unit tests. The CI didn't know developers were interested in running a specific test and executed the entire test suite, wasting scarce computing resources and slowing the development cycle.

The tests accumulated over time, the load on the CI system grew, and eventually, the builds became unbearably slow and flaky. It was time for another change.

Enter Bazel

Among about a dozen build systems I worked with, Bazel is the only one that made sense to meIt might also well be that I never learned to do anything without involving protocol buffers.. One of my favorite features of Bazel is how explicit and intuitive it is for everyday use.

Bazel is like a good videogame: it's easy to learn and challenging to master. It's easy to define and wire build targets (that's what most engineers do), but adding new build rules requires some expertise. Every engineer at Google can write correct build files without knowing much about Blaze (Google's internal variant of Bazel). The build files are verbose bordering plain boring, but it's a good thing. They tell the reader precisely what the module's artifacts and dependencies are.

Bazel offers many features, but we mostly cared about the following:

  • Bazel is extensible enough to cover all our use cases. Bazel gracefully handled everything we threw at it: Linux and macOS binaries, WebAssembly programs, OS images, Docker containers, Motoko programs, TLA+ specifications, etc. The best part is: We can also combine and mix these artifacts in any way we like.
  • Aggressive caching. The sandboxing feature ensures that build actions do not use undeclared dependencies, making it much safer to cache build artifacts and, most importantly for us, test results.
  • Remote caching. We use the cache from our CI system to speed up developer builds.
  • Distributed builds. Bazel can distribute tasks across multiple machines to finish builds even faster.
  • Visibility control. Bazel allows package authors to mark some packages as internal to prevent other teams from importing the code. Controlling dependency graphs is crucial for fast builds.

Even more importantly, Bazel unifies our development and CI environments. All our tests are Bazel tests now, meaning that every developer can run any test locally. At its heart, our CI job is bazel test --config=ci //....

One nice feature of our Bazel setup is that we can configure versions of our external dependencies in a single file. Ironically, cargo developers implemented support for workspace dependency inheritance a few weeks after we finished the migration.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK