7

Pijul (and Sanakirja) on the mainframe

 7 months ago
source link: https://pijul.org/posts/2024-02-13-mainframe/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Pijul (and Sanakirja) on the mainframe

Tuesday, February 13, 2024
By Pierre-Étienne Meunier

An underappreciated aspect of Pijul is its backend, Sanakirja, which is easy to confuse with a mere key-value store. Sanakirja does include a key-value store, but is primarily a transactional block allocator in a file.

Don’t load your datastructures

The best feature of Sanakirja is to blur the distinction between storage and operations, removing the need to load from and write to disk entire “models” every time they are used. These serialized models cause either I/O congestion and hence massive slowdowns for each operation, or else huge memory requirements for even modest loads.

While this is probably alright for some workflows and may sometimes even be desirable (for datastructure-as-a-service companies, for example), this was totally out of question for Pijul: since our initial goal was to fix Darcs’ performance issues, loading the entire history for each operation was completely out of question. While operating directly on disk is not a new thing, most libraries that do this today restrict themselves to a single datastructure (B trees), or possibly one or two more, whereas the potential for other datastructures is much bigger, including by making these structures composable.

This is how Sanakirja was born: not just a key-value store, but really a generic ACID allocator in a file, usable to build more and more datastructures. And this actually worked: over the years, I’ve found myself building a number of classical datastructures (such as Ropes, R trees, Radix trees) on top of Sanakirja, as well as more modern (some would say fashionable) datastructures.

The main downside is that the Sanakirja API is hard to read and use (please help us if you have ideas for improvements!), essentially due to the distance between on-disk data and the Rust memory model. One additional issue is that writing Sanakirja is about as hard as writing an entire memory allocator, which is probably the most unsafe thing one can do in a programming language. One language feature that could have helped is “flavouring” of the unsafe keyword to keep track of the hypotheses: for example, we would give our hypootheses names when making a function unsafe (unsafe p_is_not_null fn deref(p: *mut T)), and “clear” them in the blocks (unsafe p_is_not_null, page_is_uniquely_referenced { … }). This could even be made backwards-compatible by saying an unsafe {} block clears all possible hypotheses, while the unsafe keyword in a function declaration makes the function unsafe for the “root hypothesis”, a hypothesis that can only be cleared with the generic unsafe {} block.

Anyway… this is more or less what I ended up doing, expect very manually and by pedantically commenting every single line of each non-trivial function of crate sanakirja-core.

To the mainframe

For a few years now, Pijul has been relentlessly tested by @tankf33der, who has set out to try and run it in the most unusual situations, and for the craziest workflows. Since having no “forbidden workflow” nor any “bad practice” is definitely one of the core goals of this entire project (sorry, local Git gurus!), we couldn’t hope for a better testing strategy. Since bugs are getting hard to find, tankf33der rightfully decided to challenge my claim that Sanakirja was designed to be robust to endianness changes. That didn’t go too well, at least initially:

bigendian.png

Compared to other key-value stores or binary formats, the reason this matters is that not only do we want Pijul to work on big-endian computers in mostly the same way as on little-endian ones, we also want the formats to be stable across machines, meaning that we should be able to work from the same USB stick, plugged onto machines of different endiannesses.

The design for that feature is actually really simple: Sanakirja must read integers in the file sometimes, for example lengths or offsets in the file. Every time we do so, we assume the values to be written in little-endian; the downside is that testing is harder, but the upside is that we save a few instructions on the most common architectures on which code is edited (even for more exotic compilation targets).

In the end the issue in Sanakirja was quite hard to find, but the fix was extremely easy. In Sanakirja, there are four types of B tree nodes (internal nodes and leaves have a different design, and so do fixed-width key/value pairs vs variable width ones). I decided to plot, for each of 500+ steps of a massive “record/unrecord” sequence crafted by tankf33der, the trees that had gone wrong, using Graphviz. This revealed that immediately after a internal node split in a B tree with fixed-width entries, the leftmost entry of the resulting right-hand child was referencing a page that was far out of the file. There we were! One key property that helped the debugging is that the offsets in the file are 64-bit integers, which, when read in the wrong endianness, result in really absurd results.

Your own garage mainframe

The initial effort for finding a useful and interesting architecture to test on was entirely tankf33der’s: qemu emulating s390x. While this is really cool, it still requires quite some time to setup. We then simplified that setup a little bit at least on NixOS (or Linux+Nix), using the following shell.nix:

with import <nixpkgs> {
  overlays = map (uri: import (fetchTarball uri)) [
    https://github.com/mozilla/nixpkgs-mozilla/archive/master.tar.gz
  ];
};

let s390x = import <nixpkgs> { crossSystem = { config = "s390x-unknown-linux-gnu"; }; };
in

clangStdenv.mkDerivation rec {
  name = "s390x";
  buildInputs = [
    s390x.zstd
    s390x.libsodium
    s390x.openssl
    s390x.libiconv
    s390x.xxHash
    s390x.pkg-config
  ];
  nativeBuildInputs = [
    pkg-config
    s390x.stdenv.cc
    ((pkgs.rustChannelOf {
      channel = "stable";
    }).rust.override {
      targets = [
        "x86_64-unknown-linux-gnu"
        "s390x-unknown-linux-gnu"
      ];
    })
  ];

  CFLAGS="-I${glibc.dev}/include";
  CFLAGS_s390x-unknown-linux-gnu="-I${s390x.glibc.dev}/include";
  RUSTFLAGS="-L${glibc}/lib";
  CARGO_TARGET_S390X_UNKNOWN_LINUX_GNU_RUSTFLAGS="-L${s390x.glibc}/lib";
}

Then, for some reason Rust needs to be told which linker to use, which can be done by adding the following to .cargo/config at the root of the Pijul repository (or in ~/.cargo/config):

[target.s390x-unknown-linux-gnu]
linker = "s390x-unknown-linux-gnu-cc"

Finally, if you are running Linux with Nix (or NixOS, and possibly OSX with Nix, I don’t know), you can just do the following (with qemu installed):

nix-shell --run "cargo build --target s390x-unknown-linux-gnu --release" shell.nix

qemu-s390x target/s390x-unknown-linux-gnu/release/pijul

This runs an emulation of Pijul on the s390x architecture, using the native system calls of your system (I’ve only tested this on Linux).


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK