7

Modern perfect hashing for strings

 1 year ago
source link: http://0x80.pl/notesen/2023-04-30-lookup-in-strings.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Constructing perfect hashing function

Short recap:

  • A collision of hashes occurs when a hash function yields the same value for two different keys.
  • A perfect hash function is a hash function that does not yield any collisions for the given set of keys.
  • A minimal perfect hash function (MPH) is a perfect hash function that maps the keys into the range from 0 to N-1, where N is the number of keys.

While minimal hash functions are hard to find, perfect hash functions are quite easy to compute. We will describe two methods:

  • using open addressing,
  • using the minimum number of bits.

Open addressing

When we build a generic hash table implementation, collisions are unavoidable. The most common solution to this problem is using some kind of container (a linked list or balanced tree) to keep all objects sharing the same hash value.

A less common strategy is open addressing. We store values directly in the table. If there is a collision — that is, the given index is already allocated — we try another index. There are several approaches to pick a new index; it simply may be the next index (called linear probing), but there are of course more sophisticated procedures (quadratic probing, double hashing, etc.).

In our case, where we have a static set of strings, we use the simplest solution. Our goal is to find the minimum size of table (N) in which the number of collisions (k) is also the least possible, preferably without collisions.

Inputs parameters are: 1) a set of strings and 2) hash function; to avoid calling the hash function over and over, we cache the hash value for each string.

The main procedure is responsible for settings the maximum number of collisions; it starts from 1 and increases it until we succeeded. The number of collisions depends on the quality of hash function, but often it is just 2 or 3 collisions. The Go implementation of procedure is shown below.

Then, for some range of table sizes, we check the actual number of collisions, and we report the minimum table size.

func computerHashParameters(keywords []Keyword, hash func([]byte) uint64) (size int, collisions int) {
    hashes := make([]uint64, len(keywords))
    for i := range keywords {
        hashes[i] = hash(keywords[i].word)
    }
    collisions = 1
    for {
        size = hashtableaux(hashes, collisions)
        if size > 0 {
            fmt.Printf("size = %5d, maxcollisions = %d\n", size, collisions)
            return
        }
        collisions += 1
    }
}

Below is the implementation of a helper function. It checks table size starting from the size of input set, up to its tenfold; multiply of 10 appeared to be a quite good upper bound.

func hashtableaux(hashes []uint64, maxcollisions int) int {
    n := len(hashes)
    const sizeOverhead = 10

outer:
    for N := n; N < sizeOverhead*n; N++ {
        table := make(map[uint64]int)
        for _, h := range hashes {
            idx := h % uint64(N)
            table[idx] += 1
            if table[idx] > maxcollisions {
                continue outer
            }
        }
        return N
    }

    return -1
}

Now, having the parameters N and k, we can finally write a procedure for lookup. The following C++ procedure lookups for one of Go keywords.

int lookup_go_hash1(std::string_view s) {
    const uint64_t idx = (hash1(s) % 38) * 2;
    static std::string_view lookup[76] = {
        "chan", // 836 (0x344)
        "", "", "",
        "case", // 800 (0x320)
        "for", // 648 (0x288)
        "", "",
        "continue", // 1600 (0x640)
        "", "", "",
        "defer", // 1070 (0x42e)
        "func", // 804 (0x324)
        "", "", "", "",
        "package", // 1491 (0x5d3)
        "",
        "else", // 808 (0x328)
        "go", // 428 (0x1ac)
        "const", // 1075 (0x433)
        "range", // 1075 (0x433)
        "var", // 696 (0x2b8)
        "", "", "",
        "return", // 1344 (0x540)
        "", "", "", "", "",
        "map", // 663 (0x297)
        "",
        "select", // 1386 (0x56a)
        "struct", // 1386 (0x56a)
        "", "",
        "goto", // 856 (0x358)
        "", "", "",
        "switch", // 1314 (0x522)
        "", "", "",
        "fallthrough", // 2266 (0x8da)
        "", "", "", "", "", "", "", "", "", "", "",
        "default", // 1512 (0x5e8)
        "interface", // 1854 (0x73e)
        "", "",
        "type", // 868 (0x364)
        "", "", "",
        "if", // 414 (0x19e)
        "import", // 1326 (0x52e)
        "", "", "", "",
        "break", // 1025 (0x401)
        "", };
    static int values[76] = {
        2, // 836 (0x344)
        -1, -1, -1,
        1, // 800 (0x320)
        9, // 648 (0x288)
        -1, -1,
        4, // 1600 (0x640)
        -1, -1, -1,
        6, // 1070 (0x42e)
        10, // 804 (0x324)
        -1, -1, -1, -1,
        17, // 1491 (0x5d3)
        -1,
        7, // 808 (0x328)
        11, // 428 (0x1ac)
        3, // 1075 (0x433)
        18, // 1075 (0x433)
        24, // 696 (0x2b8)
        -1, -1, -1,
        19, // 1344 (0x540)
        -1, -1, -1, -1, -1,
        16, // 663 (0x297)
        -1,
        20, // 1386 (0x56a)
        21, // 1386 (0x56a)
        -1, -1,
        12, // 856 (0x358)
        -1, -1, -1,
        22, // 1314 (0x522)
        -1, -1, -1,
        8, // 2266 (0x8da)
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        5, // 1512 (0x5e8)
        15, // 1854 (0x73e)
        -1, -1,
        23, // 868 (0x364)
        -1, -1, -1,
        13, // 414 (0x19e)
        14, // 1326 (0x52e)
        -1, -1, -1, -1,
        0, // 1025 (0x401)
        -1,
    };
    for (int i=0; i < 2; i++) {
        if (lookup[idx + i] == s) {
            return values[idx + i];
        }
    }
    return -1;
}

First we have a call to hash1 on the input string. Its result is taken modulo 38 — that is the value N found by the above algorithm. Likewise, the magic constant 2 is parameter k, and it is the number of collisions we have to handle.

The lookup procedure contains two auxiliary tables of size N * k = 76: one for strings and another for values; the not-found-value was set to -1. (Note that we can have a single array of structures, however it's hard to tell what's better for typically small arrays.) After getting index idx we just run two times the following check:

if (lookup[idx + i] == s) {
    return values[idx + i];
}

Please note that all parameters for hash table are compile time, thus a compiler can rid of division and apply additional optimizations.

Below is assembly code of the procedure, compiled for Skylake architecture with GCC 12.2.0.

<lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)>:
    push   %r14
    push   %r13
    mov    %rsi,%r13
    push   %r12
    push   %rbp
    mov    %rdi,%rbp
    push   %rbx

    movsbq (%rsi),%rax                  // f := first char
    movsbq -0x1(%rdi,%rsi,1),%rcx       // l := last char
    add    %rax,%rcx                    // their sum
    imul   %rdi,%rcx                    // (f + l) * s.size()

    movabs $0xd79435e50d79435f,%rax     //
    mul    %rcx                         //
    shr    $0x5,%rdx                    //
    imul   $0x26,%rdx,%rax              //
    sub    %rax,%rcx                    //
    mov    %rcx,%rdx                    //
    lea    (%rcx,%rcx,1),%r12           //
    shl    $0x5,%rdx                    // rdx = ((f + l) * s.size()) % 38

    // The for loop. GCC split the loop -- it quickly compares lengths
    // of lookup[idx] + i with s.size (rbp) and only if they are equal
    // jumps to bytes comparison.
    lea    0x267b47(%rip),%rax          // rax = lookup address
    lea    (%rdx,%rax,1),%rbx           // rbx = size
    lea    0x2(%r12),%r14
    mov    0x8(%rbx),%rdi
    cmp    (%rbx),%rbp                  // jump to full compare only if lengths are equal
    je     <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)+0x78>
    inc    %r12
    add    $0x10,%rbx
    cmp    %r14,%r12
    jne    <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)+0x52>
    pop    %rbx
    pop    %rbp
    pop    %r12
    pop    %r13
    mov    $0xffffffff,%eax
    pop    %r14
    ret
    nopl   (%rax)

    // check if bytes are equal
    test   %rbp,%rbp
    je     <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)+0x8c>
    mov    %rbp,%rdx
    mov    %r13,%rsi
    call   <memcmp@plt>
    test   %eax,%eax
    jne    <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)+0x5b>
    pop    %rbx
    pop    %rbp
    lea    0x1f56bb(%rip),%rax        # <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)::values>
    mov    (%rax,%r12,4),%eax
    pop    %r12
    pop    %r13
    pop    %r14
    ret

Hash functions

Following hash functions were tested:

  • hash1 — sum of the first and last char multiplied by the string length;
  • hash2 — sum of the first and last char bit-xor'ed with the string length;
  • hash3 — first, second and last characters combined with the string; length: ((f << 8) | s) + l + (n * 0x111);
  • hash_sum0 — sum of all bytes;
  • hash_sumN — sum of all bytes plus the string length;
  • hash_djb2 — Berenstein's hash;
  • hash_sdb — sdbm hash.

Splitting by input length

I observed that the string length is a decent discriminator: STL: map with string as key — access speedup. In most cases we obtain the length of strings in constant time; for instance this C++ function simplifies to a few CPU instructions.

#include <cstdint>
#include <string_view>

uint64_t len(std::string_view s) {
   return s.size();
}
$ g++ -O2 -S test.cpp
$ cat test.s
    ...
    movq    %rdi, %rax
    ret
    ...

Taking this into account, we first split the set of strings into the subsets of strings having the same length. These subsets are usually small. We can fine-tune looking up in such substrings independently. One common optimization that is present in sample code is using plain if-ladder of compares when a subset is small (has two or three elements).

Additionally, working with same-length strings enables more low-level optimizations:

  • The length of string is usually known in the compile-time.
  • Thus, bound checks can be eliminated.
  • Equality of two strings from a subset is simple memcmp, and this procedure may be further inlined and optimized by a compiler.

Using the minimum number of bits

Note: this method was inspired by the hashing procedure found in GNU gperf.

The presented method requires input string to have the same lengths. We guaranteed that property by pre-classifying the input string by its length, as described in the previous section.

gperf uses the following observation: not all input characters contribute to uniqueness of set. In most cases we may discard many input characters and such modified set still holds unique items.

For instance, five-letter Go keywords are break, const, defer, and range. The last character is sufficient to distinguish them:

break => ____k
const => ____t
defer => ____r
range => ____e

In the case of six-letter keywords, we need two characters, as shown below. The letter 't' repeats at the last position, but the second character makes this set unique.

import => ___o_t
return => ___u_n
select => ___e_t
struct => ___u_t
switch => ___t_h

The major problem is that we need to combine somehow these unique characters into an unique and relatively small value. gperf adds the characters after passing them through an auxiliary lookup table. Like: lookup[s[0]] + lookup[s[5]].

Similarly to discarding individual bytes, we may discard individual bits, ending up with a bitmask that can be used to extract significant bits. To do this, we use a very fast instruction PEXT.

Before we dig into details, let us re-examine the examples. For five-letter Go keywords we have:

01100010.01110010.01100101.01100001.01101011 => ________.________.________.________.______11
01100011.01101111.01101110.01110011.01110100 => ________.________.________.________.______00
01100100.01100101.01100110.01100101.01110010 => ________.________.________.________.______10
01110010.01100001.01101110.01100111.01100101 => ________.________.________.________.______01

And for six-letter ones:

01101001.01101101.01110000.01101111.01110010.01110100 => ________.________.________.___0____._______0._____1__
01110010.01100101.01110100.01110101.01110010.01101110 => ________.________.________.___1____._______0._____1__
01110011.01100101.01101100.01100101.01100011.01110100 => ________.________.________.___0____._______1._____1__
01110011.01110100.01110010.01110101.01100011.01110100 => ________.________.________.___1____._______1._____1__
01110011.01110111.01101001.01110100.01100011.01101000 => ________.________.________.___1____._______1._____0__

Thus, in the first case we need just 2 bits, and in the second bits only 3 bits. The 2-bit subset contains all possible values (00, 01, 11, 10), while 3-bit subset is not full (001, 101, 011, 111, 110).

A skeleton of C++ procedure that uses this approach is:

...
/* 1 */ const uint64_t word = load_bytes(s);
/* 2 */ const uint64_t idx  = _pext_u64(word, mask);
/* 3 */ if (memcmp(lookup[idx], s.data(), size) == 0) {
            return value[idx];
        }
...

Once we know which bits are significant, we need to find which bytes have to be loaded (1); a byte is the minimum amount of data we can transfer. As we can see, for five-letter keywords it is sufficient to load just one byte. But for six-letter keywords we need three bytes — in practise it is still a single load, but done on an 32-bit entity.

Then, we extract the bits using the instruction PEXT (2) — these bits form an n-bit index.

We need an auxiliary table(s) of size 2n, having exactly the same meaning as in hashing. We compare the input string with keyword (3) and if they equal, we return the associated value.

This is a snippet from procedure for looking up the Go keywords:

int lookup_go_pext(std::string_view s) {
    switch (s.size()) {
        // ...
        case 5: {
            static char lookup[4][5] = {
                {'b', 'r', 'e', 'a', 'k'},
                {'r', 'a', 'n', 'g', 'e'},
                {'d', 'e', 'f', 'e', 'r'},
                {'c', 'o', 'n', 's', 't'},
            };
            static int value[4] = {
                0,
                18,
                6,
                3,
            };
            const uint8_t w2 = s[4];
            const uint64_t idx = _pext_u64(w2, 0x14);
            if (memcmp(lookup[idx], s.data(), sizeof(lookup[0])) == 0) {
                return value[idx];
            }
        }
        break;
        case 6: {
            static char lookup[8][6] = {
                {}, // no match
                {'s', 'w', 'i', 't', 'c', 'h'},
                {}, // no match
                {'r', 'e', 't', 'u', 'r', 'n'},
                {'s', 'e', 'l', 'e', 'c', 't'},
                {'s', 't', 'r', 'u', 'c', 't'},
                {'i', 'm', 'p', 'o', 'r', 't'},
                {}, // no match
            };
            static int value[8] = {
                -1,
                22,
                -1,
                19,
                20,
                21,
                14,
                -1,
            };
            uint32_t w3 = 0;
            memcpy(&w3, &s[2], 4);
            const uint64_t idx = _pext_u64(w3, 0x10101000);
            if (memcmp(lookup[idx], s.data(), sizeof(lookup[0])) == 0) {
                return value[idx];
            }
        }
        break;
        // ...
    }
    return -1;
}

Variable-length set of words

Having strings of the same length is not a strict requirement. We always can complete the past-end characters with some easy-to-compute values. The following values were tested during writing this text:

  • zero (0x00),
  • the first character,
  • the last character,
  • the input length,
  • the first/last character combined with the input length.

Depending on the chosen method and string set, the outcome differs. Including length and one of chars seems to help. However, pre-classifying inputs by their length appeared to be easier and faster, thus this topic is not covered here.

Finding mask

Finding the mask is surprisingly straightforward. We start with a full mask, and iterate over all bits. For each bit we unset it, and then we check if the set of masked bytes is still unique. If it's true, we keep that bit unset, otherwise revert the change.

Below are shown two procedures that compute the mask.

func computeMask(words []Keyword) (mask []byte, err error) {
    max := 0
    for _, w := range words {
        if n := len(w.word); n > max {
            max = n
        }
    }

    mask = make([]byte, max)
    for i := range mask {
        mask[i] = 0xff
    }

    ok := checksamesize(words, mask)
    if !ok {
        err = fmt.Errorf("set of words is not unique")
        return
    }

    for byteIdx := 0; byteIdx < max; byteIdx++ {
        for bitIdx := 0; bitIdx < 8; bitIdx++ {
            old := mask[byteIdx]
            mask[byteIdx] &= ^(1 << bitIdx)
            if !checksamesize(words, mask) {
                mask[byteIdx] = old
            }
        }
    }

    return
}

func checksamesize(words []Keyword, mask []byte) bool {
    M := make(map[string]struct{})
    word := make([]byte, len(mask))
    for i := range words {
        for j := range mask {
            word[j] = words[i].word[j] & mask[j]
        }

        s := string(word)
        _, ok := M[s]
        if ok {
            return false
        }
        M[s] = struct{}{}
    }

    return true
}

Loads

As it was shown in examples above, often a single load is sufficient, as the significant bits fit in a 8-, 16-, 32-, or 64-bit word.

However, sometimes more loads are needed. In the generic case we concatenate results of individual loads, forming a 32- or 64-bit word. Below is an example of such code, where we combine 8-bit and 16-bit input words into a single 32-bit word, which is the final argument to PEXT. Note that the we need only to concatenate values in runtime, the mask for PEXT is know in the compile-time.

uint32_t w1;
uint16_t w2 = 0;
memcpy(&w2, &s[0], 2);
w1 = uint32_t(w2);
const uint8_t w3 = s[2];
w1 |= (uint32_t(w3) << 16);
const uint64_t idx = _pext_u64(w1, 0x1e1104);

Another possibility is when we have exactly two loads and their masks do not collide, their bit-and result is zero. Then we can merge these two words; sample code for such case is shown below.

uint16_t w7 = 0;
memcpy(&w7, &s[10], 2);
const uint8_t w8 = s[2];
const uint16_t w9 = (uint16_t(w7) & 0x410) | (uint16_t(w8) & 0x4);
const uint64_t idx = _pext_u64(w9, 0x414);

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK