7

Ask HN: What are some cool but obscure data structures you know about?

 2 years ago
source link: https://news.ycombinator.com/item?id=32186203
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Ask HN: What are some cool but obscure data structures you know about?

Ask HN: What are some cool but obscure data structures you know about?
537 points by Uptrenda 8 hours ago | hide | past | favorite | 272 comments
I'm very interested in what types of interesting data structures are out there HN. Totally your preference.

I'll start: bloom filters. Lets you test if a value is definitely NOT in a list of pre-stored values (or POSSIBLY in a list - with adjustable probability that influences storage of the values.)

Good use-case: routing. Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements. Not so with a bloom filter! A bloom filter is one of the few data structures whose time complexity does not grow with the number of elements due to the 'keys' not needing to be stored ('search' and 'insert' is based on the number of hash functions.)

Bonus section: Golomb Coded Sets are similar to bloom filters but the storage space is much smaller. Worse performance though.

HAMT: Hash Array Mapped Trie. This data structure makes efficient immutable data possible. You can update a list of a million items, and keep a reference to the original list, by changing 3 or 4 references and some bytes.

This should replace copy-on-write for scripting languages. I really want to see it in a JS spec soon. There are libraries that can do it, but they add translation penalties and extra steps. I’d complain less if I could use Clojurescript professionally, because it and Clojure are built around them.

https://en.m.wikipedia.org/wiki/Hash_array_mapped_trie

s.gif
Another interesting approach to copy-on-write for immutable collections (for example arrays) is where you actively mutate the array in place, but leave the original as a lazy description/view of how to get back to the before state.

From the outside the effect is the same, but the performance is optimized for accessing the updated collection and only possibly using the old value.

Great for cases where you want the immutability guarantee, but where it might be unlikely the old value is actually going to be used in the general case.

s.gif
Isn't this what Sublime Text uses under the hood to give it such good performance?
s.gif
If the Record & Tuple proposal advances to stage 3 we'll finally have native immutable data structures in JS [1].

[1] https://github.com/tc39/proposal-record-tuple

s.gif
That’s what Immutable.js used under the hood.
s.gif
Yes, I evaluated it. The complexity of knowing where to convert from/to plain JS, plus the extra library syntax to learn, plus the performance cost of toJS, made it a poor fit for my particular use case. Nearly as much of a hard sell at work as saying “let’s rebuild the UI in Clojurescript,” and without providing as much benefit.

My use case is pretty atypical though, and it’s worth checking out if you have more reliable data shapes/update paths.

The union-find data structure / algorithm is useful and a lot of fun.

The goal is a data structure where you can perform operations like "a and b are in the same set", "b and c are in the same set" and then get answers to questions like "are a and c in the same set?" (yes, in this example.)

The implementation starts out pretty obvious - a tree where every element either points at itself or some thing it was merged with. To check if two elements are in the same set, check if they have the same parent. Without analyzing it, it sounds like you'll average findRoot() performance of O(log(n)), worst-case O(n).

There are a couple of simple optimizations you can do to this structure, the type of things that seem like they shouldn't end up affecting asymptotic runtime all that much. The first is that, whenever you find a root, you can re-parent all the nodes you visited on the way to that root, so they'll all be quicker to look up next time. The other is that you keep track of the size of sets, and always make the larger set be the parent of the smaller.

And neither of those actually do anything impressive alone, but if you use both, the algorithm suddenly becomes incredibly fast, with the slowest-growing (non-constant) complexity I've ever heard of: O(the inverse of the Ackermann function(n)). Or, for any reasonable N, O(4 or less).

s.gif
My writeup of union find is in https://dercuano.github.io/notes/incremental-union-find.html. I did not in fact find a way to make it efficiently support incremental edge deletion, which is what I was looking for.
s.gif
Oh, huh, this is serendipitously handy because I need to be doing something like that this weekend (translating from Go to Rust - I already have a reparenting solution using hashes but hopefully this is easier / better.)
s.gif
I have heard stories about annotating the edges with elements from a group rather than using unlabelled edges. Have you heard anything about that?
s.gif
Is there a name for the data structure/algorithm?
s.gif
It's called union-find, but also disjoint-set sometimes. The process of reparenting nodes is called path compression, and the optimization of using the larger set of the two as the parent when merging is usually just called "union by rank." Any of those terms should give you more details upon search.
s.gif
If at some point you have a cycle, you can then reparent and remove the cycle, right? This structure in general then can encode transitive relations effectively?

Something like “a and b …” “b and c …” “c and a …”.

s.gif
> If at some point you have a cycle, you can then reparent and remove the cycle, right?

You'd never have a cycle. The way a cycle would theoretically arise would have to be joining something to its own child. But you don't naively join the two objects you're given - you find their root objects, and join those. If - as would be necessary for a cycle to form - they have the same root, you can return without doing anything.

If we call

    unite(a,b)
    unite(b,c)
    unite(c,a)
then we should end up with this structure:
        c
       / \
      b   a
With parents on the right, the structure will be a-b after the first call and a-b-c after the second call. The reason we end up with a shallower tree after the third call is that when a is passed to unite, we call find on it and it gets reparented directly to c. Then, c (the representative for c) and c (the representative for a) are already related, so we don't adjust the structure further.

> This structure in general then can encode transitive relations effectively?

Well... no. This structure embodies the concept of an equivalence relation. You wouldn't be able to use it to express a general transitive relation, only a transitive relation that is also symmetric and reflexive.

s.gif
Wouldn't that be extremely useful and a polynomial solution for 3-SAT, which is NP complete?
Cache-Oblivious Data Structures: https://cs.au.dk/~gerth/MassiveData02/notes/demaine.pdf

A vaguely related notion is that naive analysis of big-O complexity in typical CS texts ignores over the increasing latency/cost of data access as the data size grows. This can't be ignored, no matter how much we would like to hand-wave it away, because physics gets in the way.

A way to think about it is that a CPU core is like a central point with "rings" of data arranged around it in a more-or-less a flat plane. The L1 cache is a tiny ring, then L2 is a bit further out physically and has a larger area, then L3 is even bigger and further away, etc... all the way out to permanent storage that's potentially across the building somewhere in a disk array.

In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

Hence, a lot of algorithms that on paper have identical performance don't in reality, because one of the two may have an extra sqrt(n) factor in there.

This is why streaming and array-based data structures and algorithms tend to be faster than random-access, even if the latter is theoretically more efficient. So for example merge join is faster than nested loop join, even though they have the same performance in theory.

s.gif
Realtime collision detection[1] has a fantastic chapter in this with some really good practical examples if I remember right.

Great book, I used to refer to it as 3D "data structures" book which is very much in theme with this thread.

[1] https://www.amazon.com/Real-Time-Collision-Detection-Interac...

s.gif
The implicit grid data structure from there is a personal favorite of mine. I used it in a game once and it performed incredibly well for our use case.

It's a bit too complicated to totally summarize here, but it uses a bit per object in the scene. Then bit-wise operations are used to perform quick set operations on objects.

This data structure got me generally interested in algorithms that use bits for set operations. I found the Roaring Bitmap github page has a lot of useful information and references wrt this topic: https://github.com/RoaringBitmap/RoaringBitmap

s.gif
I often recommend this book, RTCD, as a "better intro to data structures and algorithms" (sometimes as "the best"). The advice often falls on deaf ears, unfortunately.
s.gif
I spent a long time looking at an algorithm for long range force calculations called the Fast Multipole Method which is O(N) and found that practically it couldn't compete against an O(N log N) method we used that involved FFTs because the coefficient was so large that you'd need to simulate systems way bigger than is feasible in order for it to pay off because of cache locality, etc.
s.gif
Doing a decent job of analysis would include the sqrt(n) memory access factor. Asymptotic analysis ignores constant factors, not factors that depend on data size. It's not so much 'on paper' as 'I decided not to add a significant factor to my paper model'.

Cache oblivious algorithms attempt to make the memory access term insignificant so you could continue to ignore it. They are super neat.

s.gif
> In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

I was about to write a comment suggesting that if we made better use of three dimensional space in constructing our computers and data storage devices, we could get this extra latency factor down to the cube root of n.

But then, I decided to imagine an absurdly large computer. For that, one has to take into account a curious fact in black hole physics: the radius of the event horizon is directly proportional to the mass, rather than, as one might expect, the cube root of the mass [1]. In other words, as your storage medium gets really _really_ big, you also have to spread it out thinner and thinner to keep it from collapsing. No fixed density is safe. So, in fact, I think the extra factor for latency in galactic data sets is neither the square root nor cube root, but n itself!

[1] https://en.wikipedia.org/wiki/Schwarzschild_radius

s.gif
The main difficulty with growing computing devices in three dimensions is cooling. All that heat has to be somehow carried away. It's already difficult with 2D devices. One would have to incorporate channels for cooling liquids into the design, which takes a bite out of the gains from 3D.

https://www.anandtech.com/show/12454/analyzing-threadripper-...

s.gif
This series of blog posts explains this sqrt behaviour quite nicely, covering both theory (including black holes) and practice:

http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

s.gif
If its stored far away enough re-doing the calculation is faster. Depending on how much accuracy you need you might as well load a 2022-07-22 07:26:34 earth and have that guy on HN evaluate his thoughts all over again.
s.gif
Does anyone actually use cache-oblivious data structure in practice? Not, like, "yes I know there's a cache I will write a datastructure for that", that's common, but specifically cache-oblivious data structures?

People mention them a lot but I've never heard anyone say they actually used them.

s.gif
To an extent, the B-Tree data structure (and its variants) are cache oblivious. They smoothly improve in performance with more and more cache memory, and can scale down to just megabytes of cache over terabytes of disk.

The issue is that the last tree level tends to break this model because with some caching models it is "all or nothing" and is the biggest chunk of the data by far.

There are workarounds which make it more truly cache oblivious. How often this is used in production software? Who knows...

s.gif
Do we mean different things by "merge join" and "nested loop join" ? For me "merge join" is O(n) (but requires the data to be sorted by key) whereas "nested loop join" is O(n^2).
s.gif
Wouldn't you at least be looking at nlog(n) for the sort in the merge join?
s.gif
Yes, if the data is not already sorted. Thus it's O(n) for already sorted data and O(n log(n) + n) -- which simplifies to O(n log(n)) -- for arbitrary data.
(Fantastic post idea OP. One of the best I've ever seen :D)

Related to bloom filters, xor filters are faster and more memory efficient, but immutable.

HyperLogLog is an efficient way to estimate cardinality.

Coolest thing I've learned recently was Y-fast trie. If your dataset M is bounded integers (say, the set of all 128 bit numbers), you get membership, predecessor, or successor queries in log log time, not log, like in a normal tree.

see: https://www.youtube.com/playlist?list=PLUl4u3cNGP61hsJNdULdu... (6.851 Advanced Data Structures, Erik Demaine)

Would love to learn more "stupidly fast at the cost of conceptual complexity" things.

edit:

(adding) I don't know a name for it, because it's not one thing but a combination, but once can:

Use the first N bits from a very quick hash of a key from an unknown distribution (say a file path, or a variable name, or an address, or a binary blob,..) as a way to "shard" this key across M other fast data structures (like a tree) for search/add/remove. By changing M, you can tune the size of the terms in the O(1)+O(log) equation for running time.

Trees getting too deep for fast search? Every increase of N moves the computation from the log search of the tree to the space tradeoff of having more trees.

Added benefit is you can scale to multiple threads easily. Instead of locking the whole tree, you lock a tiny sub-tree.

Very clever. (I first saw in the Aerospike key value store)

s.gif
Big fan of HLL

Apache foundation has a fantastic DataSketches library that includes HLL and many other powerful data analytics algorithms: https://datasketches.apache.org/

Lee Rhodes has done an excellent introduction to this library - explaining some of the use cases, advantages, and things to be aware of when using these techniques: https://www.youtube.com/watch?v=nO9pauS-mGQ

s.gif
Y-fast tries are some of my favorites. I think they are heavily under utilized in modern terms. They sat by the the wayside for a long term because datasets where relatively small, at the time they where created ram didn't exist, and bitwise operations where inefficient along with many other constant factors.

Today; however, a lot of people have datasets on the order of 2^16 or 2^32 keys they need to maintain. And efficient bitwise operations (on upto 512 bits) are the norm. Y-fast tries are faster than b-trees or other datastructures. Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms. For example the hash-tables in a y-fast trie can actually be a rendezvous hash pointing to a database node. Once in that node you can hash on cores again to get to a local process for example.

s.gif
A fantastic thing about HyperLogLog is that it can be merged, so you can split your data between multiple server, precompute HLL for all IPs every minute, and then ask "how many unique IPs was there yesterday".

Discovered HLL because it's used in ClickHouse, which employ a ton of cool but obscure data structure.

s.gif
Works well in analytics cubes since they can be combined.

You can retain them across time too, such that you can ask questions like "how many unique users were there over the last N days?" without needing the source data. Great for privacy-aware analytics solutions.

s.gif
I think what you’re describing in the last example is the MinHash algorithm

edit: actually I’m not sure, sounds related though

s.gif
I forgot about Aerospike. They basically built a NAND optimized key, value store right? I remember reading about how they used the FTL and thinking they were pretty clever. I cant for the life of me find the article now. I think they were really big in the ad tech space? Is that still the case?
s.gif
AIUI Radix sort "kinda ish" related, yeah :) They both involve looking at your data at a 90 degree angle and slicing on bitplanes.
Splay trees:

These are binary search trees, but when you 'splay' the tree (such as by search) it rebalances the tree so that the search result is on top. This means while it is O(log n) for operations like other self-balancing trees, it optimizes tree depth in the face of non-random access so that recently accessed items are fewer steps into the tree.

Piece tables:

A bit more common for text editors, where you need to represent a sequence of characters in a form that allows efficient memory use and fast insertions/removals (so editing the first line isn't moving the 10MB of data that follows it). You create a series of references to spans (or pieces) of the document, possibly starting with a single span pointing to a mmap() version. Edits are done by appending/prepending pieces, which are potentially just references to subsequences of items created in fresh memory, appended into a buffer. Saving can create a single sequence (and a single span).

This has interesting variations:

- Put attributes on the pieces for formatting, such as indicating text should be rendered a different color or bolded.

- Create a hierarchy of pieces-of-pieces. With formatting attributes you are then dangerously close to a DOM.

- Retain old copies of a piece table - since your original mmap() file hasn't changed and your changes are in an append-only buffer, those piece table copies provide undo/redo state.

Here's another one I was thinking about. It's an idea for a hierarchical Free Space Bitmap. I don't know if anything like this has appeared in a real file system or not.

At the lowest level, let's have one bit represent if a 4K chunk is free space or occupied. 64 bits at this level tells you the usage of 64 * 4096 bytes (256KB)

Then Level 2. Two arrays. One bit at this level represents 64 bits in the level 1 array. One array for All-Bits-Set in level 1, one array for Any-Bit-Set in level 1. 64 bits at Level 2 tells you usage of 64 * 64 * 4096 bytes (16MB).

Then Level 3. Two arrays. One bit at this level represents 64 bits in the level 2 array. One array for All-Bits-Set in level 2, one array for Any-Bit-Set in level 2. 64 bits at Level 3 tells you the usage of 1GB.

I'd imagine there would be ways to use the hierarchical tables to quickly find a hole of a particular size without checking only the lowest level.

Skip lists! Super wacky, super fun to implement. O(log n) insertion/deletion/binary search, maintained sorted. It's probabilistic; you have a number of linked list levels, so you flip k coins to insert a node, and insert into the lowest k linked lists.

also, radix heaps are underappreciated.

  "This is the story of a clever trick that's been around for at least 35 years,     in which array values can be left uninitialized and then read during normal   operations, yet the code behaves correctly no matter what garbage is sitting in the array. Like the best programming tricks, this one is the right tool for the job in certain situations. The sleaziness of uninitialized data access is offset by performance improvements: some important operations change from linear to constant time."
https://research.swtch.com/sparse
s.gif
I remember this from Programming Pearls and I was going to post about it after reading the comments. Great stuff.
Spatial hashing.

Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.

Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.

s.gif
TBH even quadkeys are a fun answer to OPs question, many people aren't aware of them.

Simple explanation:

If you have data with x y coordinates and you know the bounds.

To compute the quad key for a point:

1. The key starts as the empty string. (I've also seen it start with "Z" to handle points outside the bounds)

2. Divide the space into 4 quadrants

3. determine which quadrant the point falls in, append a letter (A-D depending on the quadrant) to the key

4. Repeat step 3 using the quadrant bounds (i.e. recursively smaller area) until you have desired accuracy

This can then be used to efficiently find points within rectangular bounds.

s.gif
That's how mongodb geospatial indexes work IIRC
s.gif
Was about to mention this. If I recall correctly, the 2d-sphere index rounds geospatial coordinates to 5 decimals. Very occasionally, I found it would distort polygon geometries just enough to cause them to become invalid (e.g. overlapping geometries), which causes the index build to fail.

In my recent experience working with collections containing million of documents, each containing a geoJSON-style polygon/multipolygon representing a property (i.e. a block of land), I found invalid geometries to occur for about 1 document in 1 million. For a while, I suspected the data-vendor was the cause, however it became more puzzling when other geospatial software confirmed they were valid. Eventually we traced the issue to the 2d-sphere index.

A very clever workaround was suggested by a colleague of mine, inspired by [1]. It preserved the original geometries. In each document, we added a new field containing the geometry's extent. A 2d-sphere index was then built on the extent field instead of the original geometry field. Invalid geometries were no longer an issue since we were dealing with much simpler geometries that were substantially larger than the max precision of the index.

When running geoIntersects queries on our collection of millions of documents, we did so in 2 steps (aggregation queries):

1. GeoIntersects on the extent field (uses the index).

2. On the result set from the last step, perform geoIntersects on the original geometry field (operates on a much smaller set of records compared to querying the collection directly)

[1] https://www.mongodb.com/docs/manual/tutorial/create-queries-...

s.gif
It's tried and tested for sure, I first encountered them in 94 but I assume they're much older.

A little sleuthing shows that (TIL) they are an application of Z-order curves, which date back to 1906

s.gif
From a basic binary search, this is very intuitive. Your explanation was well written.
s.gif
A variation of this are dynamic spatially-hashed voxel grids, where the outer grid is spatially hashed and the grid elements are stored (as needed) as dense voxel grids. The advantages of this are that you get voxel grid-like behavior over essentially unbounded size, so long as the data is sparse (otherwise you would need to allocate a significant number of grid elements as voxel grids and the memory savings go away).

These data structures have been used in 3D reconstruction methods like CHISEL where they can often outperform octrees due to better memory access behavior.

s.gif
Ever heard of geospatial hierarchical Indices?

E.g. Uber's H3 index or Google's S2 index.

s.gif
Yeah, just reinvented it a few weeks ago and was like “whoa, that’s it?” It worked perfectly for my fast clustering use case.
s.gif
> reasonably well behaved

Curious what this means in this context. No hot spots? Not sparse?

s.gif
Performance degrades when you get many hash collisions, which will happen if you have a lot of points clumped together in space.

Sparsity is fine.

For points you expect to be unevenly distributed the more advanced spatial partitioning data structures (kd tree, BVH) perform better.

For routing you can also use a sparse radix tree, also known as a Patricia tree. This allows fast matching against a set prefixes (which is used for IP routing) without taking up a lot of space.
Here is one not on the list so far:

Set Sketches. They allow you compute the difference between two sets (for example to see if data has been replicated between two nodes) WITHOUT transmitting all the keys in one set to another.

Say you have two sets of the numbers [1, ..., 1million] all 32 bit integers, and you know one set is missing 2 random numbers. Set sketches allow you to send a "set checksum" that is only 64 BITS which allows the other party to compute those missing numbers. A naive algorithm would require 1MB of data be transferred to calculate the same thing.

*(in particular pin sketch https://github.com/sipa/minisketch).

The Zipper acts like a linked-list with a cursor, or "focused element"; it's implemented as a pair of lists in opposite orders; or, equivalently but more symmetric, as triple of (List[A], A, List[A])

Say we have a zipper containing [0, 1, 2, 3, 4, 5], and we're focusing on the 3. In code this will look like:

    ([2, 1, 0], 3, [4, 5])
Where [a, b, c] denotes a singly-linked list, with O(1) head (returning a) and tail (returning [b, c]). Notice that the first list is in reverse order.

To focus on the next element, we put the currently focused element on the first list, and pull the head off the second list:

    ([3, 2, 1, 0], 4, [5])
And vice versa to focus on the previous element:
    ([1, 0], 2, [3, 4, 5])
I like to imagine this as a string of beads, with the focused element held in our fingers and the rest hanging down:
     3
    / \
    2 4
    | |
    1 5
    |
    0
To move the focus forwards and backwards, we move our grip to the next/previous bead.

This works nicely as an immutable datastructure, since the tails of both lists can be shared (i.e. we don't need to copy the whole list, just the wrap/unwrap an element on to the head)

Zippers were first applied to lists, but have since been generalised:

- To trees: focusing on one node, and being able to move the focus up, to the left-child, or to the right child

- To having more than one focus

- To datastructures more generally, by treating them as 'derivatives' (as in calculus)

https://en.wikipedia.org/wiki/Zipper_(data_structure)

As an example, the XMonad window manager uses a zipper to keep track of the currently-focused window.

I've also used Zippers for window management, e.g. having a list of Displays, with one of them focused (accepting keyboard input); each Display containing a list of Workspaces, with one of them focused (currently shown); and each Workspace containing a list of Windows, with one of them focused (the active window). Keyboard shortcuts could shift between the previous/next Window, Workspace, or Display.

s.gif
I encountered this when I tried implementing a Brainfuck interpreter in Haskell. Representing the memory array using a List + index would be way too slow; but a Zipper is perfect!
s.gif
I don’t understand why wouldn’t you just use a list and an index. You can always access list[index+1] or list[index-1] or list[0] or list[list.length-1].

What is the benefit here?

s.gif
The benefit is the data sharing between the tails. You can very cheaply get a copy of zipper with the data at (or around) the focused element changed while also keeping the original data unchanged. Admittedly, this is something people in functional programming languages probably care much more about.
s.gif
Firstly, accessing an arbitrary list index is O(N), whilst a Zipper can access its focus in O(1) (shifting the focus is O(1) for both Zippers and list+index)

Secondly, using an index forces us to do bounds checks, keep track of the length (or re-calculate it at O(N) cost), etc. whereas a Zipper is "correct by construction"; i.e. every value of type (List[A], A, List[A]) makes sense as a zipper; whereas many values of type (List[A], Nat) don't make sense as zippers; e.g. ([], 42) doesn't make sense. Our logic also becomes easy for pattern-matching, e.g. in Scala:

    myZipper match {
      case Zipper(xs, x, y::ys) => Zipper(x::xs, y, ys)
      case z => z
    }
Also, your example of 'list[index-1]' is more complicated than it first appears. In particular, what is the type of 'index'? If it's an integer, then at least half of the possible values are meaningless (negative numbers); if its a natural (AKA unsigned integer), then it's not closed under subtraction (i.e. we have to worry about underflow)!

Thirdly, Zippers can be unbounded in both directions (i.e. the lists can be "infinite"/co-inductive). For example, they're great for implementing Turing machines, starting with an infinite blank tape in both directions. https://en.wikipedia.org/wiki/Coinduction

Fourthly, Zippers have more sharing. Here's an arbitrary example: say we have a zipper like this:

    ([c, b, a, ...], elem, [x, y, z, ...])
If we shift right, replace the focused element with 'newElem', then shift right again, we'll get this zipper:
    ([newElem, elem, c, b, a, ...], y, [z, ...])
Those operations take O(1) time and O(1) memory (and produce O(1) garbage, if we're no longer using the old zipper). In particular, since the tails of the lists ([c, b, a, ...] and [z, ...]) are unchanged, the same pointers will appear in both the old and new zippers; there has been no copying. This is important, since those lists could be arbitrarily long (or infinite!).

In contrast, if we did these operations on a list+index, everything before the replaced element would need to be reallocated; i.e. [..., a, b, c, elem, _] would need to be copied, since _ has changed from 'y, z, ...' to 'newElem, z, ...'. That requires O(N) memory allocation, takes O(N) time to do the copying (and produces O(N) garbage, if we don't want the old zipper).

Finger trees allow you to do amazing things. In essence they let you build an index, or multiple indices, for your dataset and then store them in a single structure.

When I go from Haskell back to imperative land I find myself greatly missing this ability. Sure I can make multiple hashmaps or trees or whatever but being able to stuff it all in one data structure is amazing.

One structure I built with them that is much more difficult with typical structures is a "block list" structure.

In this structure each block has a particular width and they're all stacked side by side.

I want to perform a query, "which box is at position X". So if my boxes are of widths 7,20,10, then the lookup for 2 yields the first box, the lookup for 12 yields the second, etc.

More interestingly, if add a new box between the second and third, the indices covered by the last box is increased.

With finger trees you use a sum monoid. This is easy. In other languages you have to roll your own structure either using a list (with o(n) lookup) or a tree with o(log n) lookup, but o(n) inserts (to translate the indices of future blocks)

https://andrew.gibiansky.com/blog/haskell/finger-trees/

s.gif
You might enjoy a paper I’m a coauthor on which combines finger trees with B-trees to build a data structure for optimal updates to sliding windows: Optimal and General Out-of-Order Sliding-Window Aggregation, https://www.scott-a-s.com/files/vldb2019_fiba.pdf

Slides from the conference talk: https://www.scott-a-s.com/files/vldb2019_fiba_slides.pdf

s.gif
Very interested in your idea. How does concurrent updates work with the augumented trees? I have been thinking about the same for a while something like CouchDB but segment aggregations (monoid) augumented so we can do interval queries over time too. But the only thing bothering is augumented trees are notorious of contention on concurrent updates. Do you have any ideas on merging back if we have multiple versions of augumented trees.
s.gif
You can do this with any tree, finger or otherwise, and the big-O stuff applies as long as the tree is within a constant factor of balanced. For some bizarre reason, though, the concept of caching a monoid in each node doesn’t seem to have spread to the standard libraries of most languages. It’s a pity, since that would be incredibly useful.
s.gif
If I'm not mistaken, Clojure's data structures are (or used to be) implemented using finger trees.
Conflict-free replicated data type (CRDT) is super interesting data type (and algorithm), especially when it is implemented for complex data structures like JSON: A JSON CRDT is "[...] an algorithm and formal semantics for a JSON data structure that automatically resolves concurrent modifications such that no updates are lost, and such that all replicas converge towards the same state (a conflict-free replicated datatype or CRDT)." [1].

[1] A Conflict-Free Replicated JSON Datatype (https://arxiv.org/abs/1608.03960) b Martin Kleppmann and Alastair R. Beresford.

My contribution: the Binary Numeral Tree: https://eprint.iacr.org/2021/038

This is a tree-like structure where the structure of the tree is fully determined by the number of leaves N. Specifically, each 1-bit in the binary representation of N corresponds to one of the tree's perfect binary subtrees. For that reason, I think of it more as "structure imposed upon a list" rather than a typical mutable tree with pointers and such.

What are the applications of this data structure? I am only aware of one: computing the Merkle root of a large amount of data in streaming fashion. The funny thing is that it's been discovered independently multiple times: in BLAKE3, Sia, Utreexo, and more. Maybe someday, someone will find another use for it. :)

(EDIT: using "mail" because I was using "letter" too much in my description)

The way I think of a bloom filter is like at an office mailroom or at post office mailbox where mail is grouped by the letters of people's name.

Given someone's name, you can glance and see "Uptrenda? No letters for 'U'" very quickly. This is when a bloom filter returns FALSE. But if they glance and see mail in the "U" box, then they return MAYBE. After a MAYBE, to see if you actually have any mail, they need to actually go through the box (i.e. a bloom filter miss).

Mine are a couple that are present in the Java Collections but I'm always disappointed that other language's don't include them:

- TreeMap/Set: A RedBlack tree. It keeps keys ordered, either naturally or by using the provided Comparator, but has log(n) time on basic ops.

    - https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html
- LinkedHashMap/Set: keeps keys ordered by insert order via a linkedlist but can be configured to use access-order making it easy to use as a LRU cache. It keeps O(1) time but is less efficient than HashMap:
    - https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html
s.gif
> other language's don't include them

stl::map and stl::set in C++ are actually a tree map and tree set.

1. Merkle-trees. Helps with content-based-retrieval, efficient add/modify/delete/move node operations. Useful in scenarios where you are building a data-serialization format that needs to support efficient updates.

2. Not necessarily a data structure, but SMT solvers (like Z3) provide a very useful abstraction for certain kinds of problems. They also use some cool data structures like disjoint-sets among other things.

3. Lock-free datastructures

OBDD: Ordered Binary Decision Diagrams. This data structure allows efficient symbolic operations on boolean functions. Widely used in electronic design automation software.

[1] https://en.wikipedia.org/wiki/Binary_decision_diagram [2] https://apps.dtic.mil/sti/pdfs/ADA470446.pdf

Fibonacci Heap

Famous as the data structure that achieved optimum runtime in Djikstra’s algorithm. In practice it isn’t often used because simpler heaps outperform it except in very specific circumstances.

https://en.wikipedia.org/wiki/Fibonacci_heap

Soft heap

As far as I know this hasn’t really been implemented. It’s used for the best known minimum spanning tree algorithm. Pretty crazy. https://en.wikipedia.org/wiki/Soft_heap

Tries (or prefix trees).

We use them a lot at Pyroscope for compressing strings that have common prefixes. They are also used in databases (e.g indexes in Mongo) or file formats (e.g debug symbols in macOS/iOS Mach-O format are compressed using tries).

We have an article with some animations that illustrate the concept in case anyone's interested [0].

[0] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...

s.gif
Even though a trie is pretty standard and expected (to be known) these days, I will vote for this. It was my first deep dive into more exotic data structures after an interview question about implementing T9 that stumped me many years ago.

https://github.com/Azeem112/tries-T9-Prediction

s.gif
I wouldn't consider tries to be obscure tbh. They are the recommended solution for many leetcode-style interview problems involving string searching. I think anyone who has done serious interview prep has encountered them.

https://leetcode.com/discuss/general-discussion/931977/begin...

s.gif
I got stung by this for a test engineering role. Apparently implementing them from scratch is a pre-req to building out test infrastructure!
s.gif
I'm not sure what the point of your post is. While bloom filters are heavily used in actual production code throughout the industry, it is very rare for anyone to need to code their own, or make changes to a prior legacy implementation.

Not all educational programs will cover bloom filters, and for those that do, there's no guarantee that the students will retain the information, and be able to recall it.

I don't know of any common interview question patterns where a bloom filter would be the only optimal solution. If it does share optimality with any other data structures, you would be better off choosing something else unless you are extremely confident in your ability to implement one and explain it clearly and in-depth, since the odds of your interviewer not being familiar with them are high in comparison to other solutions.

I only know what a bloom filter is because of HN, and previous submissions like this one that have made it to the front page throughout the years. It's nowhere near as common knowledge as tries, especially if you are sampling younger cohorts.

s.gif
I was with you until the last seven words ;)

Trees were a huge part of CS practice and education historically, but have been replaced by hash-based methods in many cases. For example, in C++ std::map is generally a tree, while in a more recent language the standard Map data structure will be a hashmap. My impression is that the instruction time devoted to Bloom filters (and other hash-based methods) vs trees has shifted towards the former over the last ~20y.

s.gif
C++ STL uses trees because the generic requirement for them to work is a < operator. Designing generics around hashing is much more difficult, and most languages struggle with that.

Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations, even though they are theoretically worse due to log(n) factor. So in a way, C++ algorithms are actually more modern.

s.gif
To your point, C++ got hash structures in the standard library (unordered_map and unordered_set) in C++11 (so only about a decade ago).
s.gif
a lot of this is that other than heaps, most tree based algorithms involve O(logn) random access pointer lookups which make them relatively slow for in memory data structures
s.gif
Do you have any documentation about how tries are used in mongo indexes? Or could you point to the source?
s.gif
There's not much about it online. They do mention it in the docs, they call this "prefix compression" [0], so you can search for that. This article describes it [1], although it is somewhat high level.

They only use this for indexes. It works really well for internal mongo ids (they all start with timestamps if I remember correctly). It also works well for compound indexes. Pro tip: for compound indexes always go from low-cardinality attribute to high-cardinality attribute to get the best results from prefix compression (it's also good for speed of queries).

[0] https://www.mongodb.com/docs/manual/reference/glossary/#std-...

[1] https://medium.com/swlh/mongodb-indexes-deep-dive-understand...

HNSW, or Hierarchical Navigable Small World is a graph data structure for approximate nearest neighbor search of vectors.

https://arxiv.org/abs/1603.09320

The problem space of ANN is one of those really deep holes you can go down. It’s a game of balancing time and space, and it’s got plenty of fascinating algorithms and datastructures.

Check out http://ann-benchmarks.com/ for a comparison. HNSW is not “the best” but it’s easy to understand and is quite effective.

Nested Sets

https://en.wikipedia.org/wiki/Nested_set_model

You can ust to store tree structures in relational databases for faster querying stuff like "all nodes below node x", "How many nodes to the top node" and so on.

Querying is easy, but comes at the cost that changes like inserting and deleting are more expensive.

I used this for an auth service with roles, user and org units. pretty query heavy, worked very well.

Lazy Adaptive Trees : https://dl.acm.org/doi/10.14778/1687627.1687669

It is a B-Tree where updates to the tree are buffered in the branch nodes and are propagated down to the leaf nodes at a later point in time.

> ...bloom filters... Good use-case: routing. Say you have a list of 1 million IPs that are black listed...

Seems-needed clarification: Bloom filters suffer from false positives, but not false negatives. So - if BloomFilterBlacklistCheck( $IP ) returns false, your router can safely conclude "not blacklisted". But if it returns true, then you'll need to perform further checks. (And you can't just use a Bloom filter of known false positive - that will also suffer from false positive, potentially letting in traffic from blacklisted IPs.)

Use arrayanes instead of matrix efficiently in boardgames like tetris or chess"A bitboard is a specialized bit array data structure commonly used in computer systems that play board games, where each bit corresponds to a game board space or piece. This allows parallel bitwise operations to set or query the game state, or determine moves or plays in the game"
Huet's zipper. https://en.wikipedia.org/wiki/Zipper_(data_structure).

One zipper value represents a regular tree of data, but from the perspective of the "current position". There are traversal operations that take a zipper value and return the same tree from the perspective of an element above/beside/below the current position, and there are operations that return the same structure but with the current element changed, or items deleted or inserted.

Huet's paper is an easy read if you've been exposed to OCaml or similar languages: https://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced... . His "returned glove" metaphor is what made it click for me.

Clojure includes an implementation of Huet's zippers https://github.com/clojure/clojure/blob/master/src/clj/cloju... that is <300 lines of code. It's very, very clever and broadly useful for dealing with XML or other nested data structures.

Some ones I've used recently:

The "golden section search" to find a the minimum (or maximum) of a unimodal function. An actual real-world use case for the golden ratio.

Exponentially Weighted Moving Average filters. Or how to have a moving average without saving any data points..

Some of my classic favorites:

Skiplists: they are sorted trees, but the algorithms are low complexity which is nice.

Boyer-Moore string search is awesome..

Bit twiddling algorithms from Hackers Delight: for example (x &-x) isolates the least significant set bit: basically use the ALU's carry chain to determine priority.

Another is compress from Hacker's Delight. It very quickly compresses selected bits (from a bitmask), all to the right of the word. It's useful to make certain hash functions.

The humble circular queue. If your circular queue is a power of 2 in size, then the number of bits needed for indexes is at least log2(size) + 1. The + 1 is key- it allows you to distinguish between completely full and completely empty with no effort. So:

    Empty: (write_index == read_index)
    Full: (write_index == read_index + size)
    Count: (write_index - read_index)
    Insert: queue[write_index++ & (size - 1)] = data
    Remove: data = queue[read_index++ & (size - 1)];
All lossless data compression algorithms are amazing.
s.gif
Regarding the bit “compress” operation, this is supported in hardware (PEXT instruction) by Haswell and newer processors, though unfortunately most AMD processors have poor implementations. I’ve found it to be quite handy when implementing subsetting operations.
Fenwick Trees (which, despite the name, are implemented using an array) allow counting prefix sums AND updating prefix sums in O(log n) time. Very useful when n is in the order of millions. I have used them a few times in Project Euler problems.

https://en.wikipedia.org/wiki/Fenwick_tree

s.gif
Is it possible to implement a Fenwick tree using a tree, and support quickly adding and removing items as well as quickly shifting the positions of suffixes?
s.gif
Yes! This is both possible and fairly easy.
Disjoint-Sets have a very cool implementation whose amortized time complexity is extremely slow growing. It is not quite constant, but even for a disjoint-set with as many elements as there are particles in the universe, the amortized cost of an operation will be less than or equal to 4.

https://en.wikipedia.org/wiki/Disjoint-set_data_structure

s.gif
aka "union find" — some other comments in the thread call it that
s.gif
Use the idea[1] at work to make large NP-complete problems more tractable by breaking the problem up into disjoint subsets.

[1] doesn't implement the inverse-ackermann algorithm but still implemented as union/find.

s.gif
I've been told to consider it constant for all practical purposes
s.gif
The inverse Ackermann function is one of the slowest-growing functions that you’re ever likely to encounter in the wild. To say that it’s “constant for all practical purposes” is 100% true but doesn’t do justice to how amazingly slow this function is to grow.

https://en.m.wikipedia.org/wiki/Ackermann_function

Tree automata and the BURS (Bottom Up Rewrite System) algorithm are pretty cool.
The Aho-Corasick automaton [0]. You have a bunch of strings, and you want to know if and where they appear in a longer string. You build a search trie from the strings, but add links back from each node to the node representing the longest suffix of the string it represents that is also in the trie, so you can skip backtracking as you scan, yielding linear time complexity in O(length of string being searched + number of results found).

[0] https://en.wikipedia.org/wiki/Aho–Corasick_algorithm

s.gif
The aho-corasick Rust crate[1] also provides a way to build the automaton in a slightly different way to provide leftmost first or "preference first" match semantics. This matches how popular backtracking regex engines implement alternations of literals, for example. (It also provides leftmost longest semantics, which matches how POSIX regexes implement alternations.)

[1]: https://docs.rs/aho-corasick

Treap: https://en.wikipedia.org/wiki/Treap

It's like a Red-Black tree in use case, but much faster to implement, which is good for competitive programming. The average case complexity is the same for all operations, but there's an unlikely degeneration to worst-case linked-list behaviour.

Lazy Propagation Segment Tree: https://cp-algorithms.com/data_structures/segment_tree.html

Like a segment tree in that it supports range queries in logarithmic time, but supports range updates in log time as well.

I know a few more fun ones that I occasionally use in contests, but I've never touched otherwise.

s.gif
Speaking of ease of implementation, I just discovered AA trees. They are probably not "obscure" but I think they are worthy of more fame because they perform jsut as well as red-black trees and are easier to implement. Finding clear comprehensive documentation about them was not easy though, so here is for you, the best I could find : https://www.cs.umd.edu/class/fall2019/cmsc420-0201/Lects/lec...

And the, unhelpful to me, orginal paper : https://user.it.uu.se/~arnea/ps/simp.pdf

Trivial, but very useful - phase accumulator / numerically controlled oscillator.

With two unsigned integers, freq and phase, you get amazingly fine control over timing. You only use the top N bits of phase for output. It's saved my bacon in embedded systems more than once.

Not super obscure, but I remember that one specific time when circular-linked-list made a lot of sense to use, well I wanted to use it so I used it.

I had a bunch of API keys with poor expiry documentation and implementation, so to find out if a key expired it had to be used. I put it in a main "keys.pop loop" and all methods below tried to use the key. If HTTP response was (some another obscure HTTP status code like) 505, I simply called `continue;` in the loop to jump to another, without caring at all where I was before.

https://github.com/atedja/gring

Count-min sketch comes to mind [1]. It's a probabilistic data structure similar to the bloom filter. Learnt about it in this system design video "Top K Problem" [2].

[1] https://en.wikipedia.org/wiki/Count–min_sketch [2] https://www.youtube.com/watch?v=kx-XDoPjoHw

s.gif
This one is my favourite concept, bloom filter goes after that. Cool stuff about Geohash that you can expand it's concept into 3 or more dimensions and whole ide just makes you think a bit differently about coding and operating over small tokens of data. fantastic stuff
SeqHash

This data structure is an immutable, uniquely represented Sequence ADS (Authenticated Data Structure) with various interesting use cases.

It is based on a novel hashing scheme that was first introduced with SeqHash.

See http://www.bu.edu/hic/files/2015/01/versum-ccs14.pdf) by Jelle van den Hooff , a brilliant young researcher back then.

SeqHashes are uniquely represented Merkle Trees that also represents a sequence. Concatenation of two SeqHashes is O(log(n)). In turn, the cryptographic signing of SeqHashes is O(1) as each tree node carries a cryptographic hash.

Of course, for each node to carry a hash incurs a hefty overhead but that can be alleviated by (lazily) grouping nodes into buckets (turning it in some kind of BTree).

SeqHashes also don't support splitting in O(log(n)) like for example AVL trees.

I've created an Btree version of SeqHash that also allows splitting SeqHashes in O(log(n)) called SplitHash.

See: https://github.com/odipar/spread/blob/master/src/main/scala/...

Here's one I don't know if I've ever seen documented anywhere. If anyone knows a proper name for this one, let me know!

Imagine it's for a text editor, and you want to map Line Numbers to Byte Positions. But then you want to insert a byte somewhere, and you need to add 1 to all your Byte Position values.

Instead of actually keeping a big array of Byte Position values, you have a hierarchical array. The convention is that whenever you want to read the value, you add the Parent's value, and that parent's value, etc. Number of parents would be logarithmic.

So if you want to add 1 to everything past a certain point, you just apply it to some leaf nodes, then the next internal nodes, rather than applying it to every single leaf node.

s.gif
That's good for representing the text, but I'm more focused on the part about being able to add X to everything past a particular point very quickly by manipulating parent values.
s.gif
Ropes do this too; each node is labeled the size of its subtree. If you want to compute line numbers as well, add them as another label
Due to data immutability, a lot of the standard data structures in Haskell are quite interesting.

Eg `Data.Sequence` (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...) is based on Finger Trees (http://staff.city.ac.uk/~ross/papers/FingerTree.html) which allow O(log(min(i,n-i))) inserts (at i) and O(1) inserts at the beginning or end even thou the data structure is basically copy-on-write.

In general Haskell gas a lot of specialized data structures. On of my favorites is `Data.IntMap` (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...) which is a specialized map for integer keys (based on Patricia trees iirc).

(Man I miss working in Haskell :-( )

Sparse sets.

They're often used in Entity Component System architectures since you have O(1) add/remove/find, & O(n) iteration. Iterations are also packed, which is a bonus for cache efficiency.

Can be difficult to implement well, but the concept is simple and a neat example of a useful specialty data structure. Take a look at https://github.com/skypjack/entt

s.gif
You can apply the same idea to hash tables: Store hashes and dense-array indices in your sparse array, during lookup use robin hood hashing to cache efficiently probe the sparse array for a matching hash value, and if a match is found use the dense-array indice to lookup the data in the packed/dense array. This approach is both cache efficient and space efficient as the dense array can grow independently of the sparse array.
1. Probabilistic filtering and matching:

Since you mentioned bloom filters - other probabilistic data structures like count-min sketches (roughly, streaming bloom filters) are super useful. Approximate kmer methods like minhash and w-shingling use them in really cool ways. Rolling hash methods like Rabin chunking also work really nicely with probabilistic/streaming hash tables - splitting the data stream into chunks that can be consistently filtered or matched is useful in many circumstances! Think NSA-level data harvesting, or realtime genome/geospatial data classification.

2. Checking for, locating and counting subsequences in giant string datasets:

Wavelet trees, FM-indices, suffix arrays more generally - and structures that allow extreme performance in very niche situations with arbitrary encodings like bitvectors and compressed integer vectors. If you have a million books, or all the genomes ever sequenced, you use the first structures to find where a specific phrase can be found, even if you don't quite spell it right. You can use the latter ones to do super fast comparisons of giant datasets - given a million viruses, how similar is each one to the human genome, and where specifically do they most closely match?

3. Reconstructing structured information from (potentially noisy) fragments:

De-brujn graphs (and related graphs like string graphs, colored de brujns). This is a way to find all the overlap connections between fragments of information, and then weight the possible traversals of those connections by the evidence. These can be represented using the data structures from #1 (FM-indices for example), and efficiently used in some circumstances with those from #2 to enable some kinds of graph algorithms. If you have a shredded set of documents, or a billion short DNA reads from a genome sequencing experiment, this is how you reconstruct the original.

4. Decentralised coordination structures. Merkle-DAGs and Kademlia DHTs in particular. Being able to compare any trees by root-first hashes, and being able to request the content of the subtree for any node hash from an arbitrary set of peers - these structures enable the p2p web infrastructure. Everything from Limewire to bittorrent, IPFS and blockchains, and, most importantly, Sci-Hub.

1, 2 and 3 together are some of the fundamentals of computational biology. If you're interested in understanding them, https://rosalind.info/problems/list-view/ is a great starting place.

s.gif
Funny you should mention bloom filters and that use case, I just re-read this post again this morning: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

Basically, it makes a similar point to https://news.ycombinator.com/item?id=32186837 — i.e., cache effects are not modeled in big O analysis, even though most problems that involve large amounts of data (so most problems) are memory bound rather than CPU bound. The blog post makes a good case for a very simple hash table as a better fit for this problem.

s.gif
That Cloudflare article is a little frustrating.

> While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler

Yes, standard Bloom filters fail for large filter sizes and/or very small false-positive rates. But we've known this for decades, and tons of other probabilistic filters have come out since then to address the problem. Cuckoo filters in particular are incredible.

Was there really no easy way to bring in an open-source Cuckoo filter implementation? They're not that complicated. Maybe I'm just used to modern languages having good package managers.

Plus the author's solution is basically the idea behind Cuckoo filters: "what if we just use a hash table, but allow for collisions?" Cuckoo hashing is just clever about guaranteeing O(1) worst-case lookup.

And for absolute dead-simple filters, a blocked Bloom filter is basically just as easy as a Bloom filter. It's like two or three more lines of code. It's just changing the indexing operation to be modulo some large "block" size. That said, I don't think they'd work too well in this case, since the author has a pretty small false-positive rate (0.000019), and in my experience small Bloom filters (which is what comprise blocked Bloom filters) don't handle small FP rates well.

But guess what excel at small FP rates… cuckoo filters.

s.gif
A linear probing hash table is simpler. That’s the trade off they were going for at that time. I don’t think that’s the most efficient solution given the hardware they were using, and I don’t think the blog author would either, but it’s certainly easier to write such a hash table — and it’s well written, but still mostly interview level stuff.

To me the blog post is not about cuckoo or bloom filters or hash tables at all. It’s about profiling and highlighting that a naive reading of the literature can easily lead you astray (worse performance and complexity). In school they don’t teach you that mov is the biggest cycle-eater of them all — at least it’s not the lesson people remember.

s.gif
If you’re willing to use moderately more memory, you can get much better cache locality using a blocked bloom filter. The basic idea is to use an array of bloom filters that each fit into a cache line, and use some bits of hash to determine which sub-filter each item goes into. Suddenly each lookup or insertion has at most one cache miss!
Dancing links https://en.m.wikipedia.org/wiki/Dancing_Links

“ In computer science, dancing links (DLX) is a technique for adding and deleting a node from a circular doubly linked list. It is particularly useful for efficiently implementing backtracking algorithms, such as Donald Knuth's Algorithm X for the exact cover problem.[1] Algorithm X is a recursive, nondeterministic, depth-first, backtracking algorithm that finds all solutions to the exact cover problem. Some of the better-known exact cover problems include tiling, the n queens problem, and Sudoku.”

Do you know about:

* HyperLogLog? Convenient for doing approximate counting across huge datasets.

* SkipList is another probabilistic data structure that allows you to skip ahead N elements in 1 step. I believe ClickHouse uses it.

* Bitmap index organizes database pointers into a matrix of bits. The bitmap scans simply skip over the zeros. This type of index gets in trouble if you have high cardinality, though.

s.gif
Lots of companies use skiplists without realizing it because Redis uses them as the implementation of sorted sets. For example, last time I checked, every time someone looks at a playlist on PornHub the software is doing a partial traversal of a skiplist behind the scenes.
Not sure you could call it a data structure as such, but Latent Semantic Indexing[1] kinda blew my mind when I learned about it.

The fact that you could perform linear algebra on documents and words, to get out a number that somehow quantified how close a document was to another really surprised me back in the day. Grew a new appreciation for what exactly those vectors in linear algebra could represent.

Not in the field so don't know how obscure it is though.

[1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis#Laten...

Fractional Cascading. A simple and very cool way to speed up searching for the same key in multiple lists. Instead of K binary searches taking time Klog(N), you can do it in log(N) time without using asymptomatically more space.

https://en.m.wikipedia.org/wiki/Fractional_cascading

I wrote a simple demo in Rust a while back to help myself learn the language.

https://github.com/mgraczyk/fractional_cascading

I also think pairing heaps are neat.

Invertible Bloom Lookup Tables. They're like Bloom filters but you can also remove elements and even reconstruct elements in certain cases. Useful for syncing 2 databases which are almost in-sync, by sending only a small amount of data between them. It's used by Bitcoin nodes to communicate the contents of newly mined blocks.
Reservoir sampling is a statistical technique used to randomly select a finite number of elements from a population. The elements are chosen such that each element has an equal probability of being selected. This technique is often used when it is impractical to select a random sample of elements from a very large population.

To do reservoir sampling, you first need to decide how many items you want in your sample. This number is called the size of the reservoir. Once you have the size of the reservoir, you can fill it by selecting items from the population at random.

The code is beautifully simple:

  for i = k to population.length-1
    j = random integer between 0 and i, inclusive
    if j < k
       reservoir[j] = population[i]
  return reservoir
s.gif
Reservoir sampling is really cool - a slightly optimized version (algorithm L) let's you skip over every record that will not be sampled and it is still pretty simple. If your records are fixed size this can be an awesome speedup.

(* S has items to sample, R will contain the result )

ReservoirSample(S[1..n], R[1..k])

  // fill the reservoir array
  for i = 1 to k
      R[i] := S[i]

  (* random() generates a uniform (0,1) random number *)
  W := exp(log(random())/k)

  while i <= n
      i := i + floor(log(random())/log(1-W)) + 1
      if i <= n
          (* replace a random item of the reservoir with item i *)
          R[randomInteger(1,k)] := S[i]  // random index between 1 and k, inclusive
          W := W * exp(log(random())/k)
https://en.m.wikipedia.org/wiki/Reservoir_sampling
Linked Hash/Tree Maps, simple, but elegant. A Map with its nodes connected in a linked list so you can traverse them in insertion order (and O(n) time). Very useful for window queries over sequential data and other cases where you want FIFO access, but also quick access by a field of the data.
s.gif
You forgot the most important feature over normal hash maps: they offer a deterministic iteration order without incurring the cost of a tree-based ordered map.

(If you don't know why this is important then maybe you haven't worked on large systems that undergo rigorous evaluation)

s.gif
Is (non-)determinism really the right concern here? I’m aware that most hash tables do not have generally predictable iteration orders, but I nevertheless understood them to be deterministic.
s.gif
They are deterministic in the sense that provided everything else is the same, iteration order would be the same after each insertion.

However, it can change after insertion (if the hash bucket count changes). If the hash function is random-keyed, iteration order would be different for different instances of the hash table even if you insert the same items in the same order.

Sometimes you want the order to be consistent every time. Be it insertion order, or some natural order of the items.

s.gif
Well, hard/impossible to predict perhaps. Iteration order can depend on the order things were inserted and deleted and may differ from computer to computer (for example in Julia the hash-based dictionary ordering differs between 32 and 64 bit systems, and might change between versions of Julia, etc - you’d see the same thing with C++ unordered maps, etc, etc).
s.gif
You can do this without a linked list and it can be quite performant, like python’s newest dictionary implementation (appeared around Python 3.6 or 3.7 from memory).
Quadrilateral Simmelian backbones

Count quadrangle structures in graphs to derive an embeddedness coefficient for each edge and expand the graph along the most embedded edges, yielding nice clusters.

https://jgaa.info/accepted/2015/NocajOrtmannBrandes2015.19.2...

Not obscure by itself, but gifted with obscurity due to its language‘s success: JavaScript‘s Map type.

Just because of the Type being introduced very late to the language, and another very successful method named identically makes it almost impossible to google your way out of any situation.

You have to rely on core documentation and link lists. Just by using “new Map()” in JS, you’re suddenly ehem mapped 30 years back in time!

s.gif
Nuts! I was going to say splay trees. Completely silly these days, like most trees, given caching hierarchies and bursty memory accesses, but a really neat data structure all the same.
Hazard Pointers are an interesting concurrent data structure.

Suppose we've got a lot of Doodads, let's say there's a Graph of ten million Doodads, and a whole bunch (dozens? hundreds?) of threads are poking around in this same graph, maybe looking at Doodads and sometimes (but not often) removing them from the Graph.

What happens if my thread is looking at a Doodad, and meanwhile a different thread removes it from the Graph? Is the Doodad destroyed while I was looking at it? That sounds very bad. What if while I'm looking at it, the Doodad is destroyed, but, a nearly identical Doodad appears in its place. How would I know? Not acceptable.

We could reference count the Doodads, we'd need atomic reference counts, something like C++ shared_ptr -- but this would work. However now our performance is terrible, because we've got 100 threads increasing and decreasing these reference counts just to look at Doodads and each such change causes a write that must be observed on other threads.

So instead Hazard pointers go like this:

Every thread who wants to look at Doodads asks for a Hazard Pointer, usually these are never deleted once made because it's much easier. A thread sets this Hazard Pointer to point at a Doodad when it wants to look at one, it checks this was successful (if not the Doodad is gone now), then contemplates the Doodad (it can be in this state for an arbitrarily long period of time) and then it clears the hazard pointer once it is done. It can only point one Hazard Pointer at one Doodad (some schemes allow a thread to have say, two or some finite number of Hazard Pointers each)

Somewhere a Linked List of all these hazard pointers is kept. When you want to remove a Doodad, you take it out of the graph, then you check the List of Hazard Pointers. If any of them was pointing at the Doodad you removed, you put the Doodad on a special pile since it can't be deleted yet, otherwise you can delete it immediately because nobody else needs it any more. This is a relatively expensive operation, but you're only doing it for removal, not just reads.

Finally there needs to be some mopping up of that pile of "couldn't delete yet" Doodads by trying to delete them again (checking the Hazard Pointers again of course). This can be periodic, it can happen every time somebody tries to delete one, or whatever, the best policy may vary between applications.

This approach means reading Doodads costs only two local writes, no waiting around for the cache. Removing them is more expensive, but that's OK because it was rare.

s.gif
Have you used this before? What was the domain?
s.gif
I have seen this general pattern in several decently large game object systems, although I can’t think of which ones off hand, if I’m recalling correctly it tend to be used in the longer lived (aka 10s of frames) object status/object existence checks, and often in AI subsystems, pathing and general decision making queries, etc.
s.gif
Seems like games are an 'easy' case for this kind of thing, since you can synchronise every frame, and arbitrate destruction then; no?
s.gif
These are not commonly used in application code, but are "commonly" used to implement reclamation in languages without memory management for concurrent mutable data structures.
I saw this and bloom filter was my very first thought, so I'm glad you had it.

We used it to check if you had voted on a discussion on reddit. It almost all cases the user had not voted, so it was a lot quicker to check if you hadn't voted than if you had.

Disruptor queues are also fun. They are lock free multi-producer multi-consumer circular buffers. I figured out the basics on my own in a vacuum out of need, then discovered a relevant white paper.

I used one to implement a fast shared memory pub-sub message bus for communicating across dozens of processes in a simulation framework.

s.gif
Do you have a link to the white paper you found?
s.gif
This doesn't look like I remember, but this sounds right.

https://lmax-exchange.github.io/disruptor/disruptor.html#_de...

The basic idea is that you maintain two atomically incremented counters. You increment one to allocate some buffer space to write your message into, and you increment the other once the message is written in. There's some tricky details to get right when multiple producers have messages staged up at the same time.

s.gif
Yes, R-tree turned out to be a godsend when working with larger geospatial data and calculating intersections.
s.gif
I've used quadtrees for some cool stuff I can't really talk about. All the things you think binary trees are good at but with x,y coordinates, or any n-tuple of coordinates as you suggest.
BW Trees are pretty neat [1][2]. Typically BTrees in database engines need latches to prevent concurrent writes messing up the tree but BW Trees conveniently sidestep that requirement by appending all node updates to a linked list (called a Delta Chain), and having each reader apply the delta of updates by traversing the linked list until it hits the final node in the list. Periodically, these linked lists are compacted back into single nodes.

If you like these kinds of data structures, the Database Internals [3] book has a good introduction and explanation of a handful of BTree optimizations like this.

1: https://www.microsoft.com/en-us/research/wp-content/uploads/...

2: https://www.cs.cmu.edu/~huanche1/publications/open_bwtree.pd...

3: https://www.databass.dev/

s.gif
The fact that suffix trees/suffix arrays can be built in O(n) is IMO the most surprising and "magical" result in computer science. To me, more surprising than median of medians/quick select.

There might be more "magical"/surprising things that are obscure and I am unaware of, but to me, this is it. :)

s.gif
what about randomized O(n) minimum spanning trees?
Not a data structure, but a nice concept for very fast search on immutable pre sorted array, eytzinger order. I did some experiments, it is about 2-3x faster then binary search. https://medium.com/swlh/binary-search-vs-eytzinger-order-301...
Interval trees. Really cool trees that allow fast lookups of all the ranges that contain a given point.

https://en.wikipedia.org/wiki/Interval_tree

s.gif
Found out about this DS in AoC 2021 Day 20+ problems if I remember correctly. Pretty interesting.
Deterministic acyclic finite state automata. Essentially a compressed trie. Significantly more memory efficient for real world dictionaries.

https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...

Skip lists.

They are useful for checking in O(log(n)) if a number (or string) is in any of n ranges of numbers.

You just keep a sorted list of endpoints of the ranges, and when you do a lookup, you do a binary search if the search value is in the list, and if not, between which two indexes. If it's between and even and an odd index, it's in.

Very useful if you remember that IP addresses are also integers, and with IPv6 networks become way too huge to enumerate.

s.gif
Never heard of this one until it came up in a google interview. Totally botched it. I couldn't think of any way to do it to their liking that doesn't require a dynamic cast.
I'm thinking that the bar for "obscure" here is maybe pretty low if bloom filters pass?

Here's a few fun ones.

The Burrows-Wheeler transform is the second character of each suffix in a (cyclic) suffix array, and there turns out to be a simple algorithm to recover the original text from this transform; live demo: http://canonical.org/~kragen/sw/bwt. But the new text is very easily compressible, which is the basis for bzip2. As mentioned by Blahah and dahart, this is only one of many interesting things you can do with suffix arrays; they can also do efficient regular expression search, for example. This is more interesting since three practical linear-time and -space suffix array construction algorithms were discovered in the early 02000s.

Pentti Kanerva devised a "fully distributed representation" in the 01990s, which has some features in common with bloom filters — every association stored in an FDR record is stored to an equal extent (probabilistically) in every bit of the record, so erasing some of the bits erases none of the associations but increases the error probability for all of them. You assign a random or pseudorandom bitvector (of, say, 16384 bits) to every atomic value to be stored, represent an association of two (or more) atoms as their XOR, and then take the bitwise majority rule of all the associations to be stored as the record. To retrieve an association, you XOR the record with an atom, then compute the correlation of the XOR result with all possible atoms; the correlation with the correct atoms will be many standard deviations away from the mean, unless the number of associations stored in the record is high enough that conventional data structures wouldn't have been able to store it either.

This and some related work has blossomed into a field called "hyperdimensional computing" and "vector symbolic architectures". But the problem of finding the nearest atom or atoms seems, on its face, to make this impractical: if you have 16384 possible atoms and 16384 bits in your bitvector, you would seem to need at least 16777216 bit operations to compute all the associations.

I realized the other night that if you use the rows of a Walsh–Hadamard matrix as the "pseudorandom" atom representations, you can compute all the correlations in linearithmic time with the fast Walsh–Hadamard transform. Moreover, Cohn and Lempel's result from 01977 "on fast M-sequence transforms" is that after a simple transposition you can also use the FWHT to compute all the correlations with the complete output of a maximal LFSR (a so-called "M-sequence") in linearithmic time, so you can also use the cyclic shifts of an LFSR output for your atoms without losing efficiency. Probably someone else has already figured this out previously, but I hadn't found it in the literature.

I'm not sure if the quantities used in reduced affine arithmetic count as a "data structure"? They're a data type, at least, but they're not a container. The idea is that you execute some arbitrary numerical algorithm, not merely on floating-point numbers, or even on dual numbers as in forward-mode automatic differentiation, nor even yet on (min, max) intervals as in ordinary interval arithmetic, but on a vector of coefficients [a₀, a₁, a₂...aₙ], which represents a linear function ax₀ + ax₁ + ax₂ + ... aₙxₙ, where x₀ = 1 and each of the other xᵢ ∈ [-1, 1], so this function is really a representation of an interval. Then you can assign each of the xᵢ (except x₀ and xₙ) to be the unknown error in one of the input parameters to your algorithm, whose magnitude is determined by the aᵢ value in the value of that parameter; and when you introduce rounding error, you increase aₙ enough to account for it. (The non-reduced kind of affine arithmetic just increases n without limit to account for new roundoff errors.)

Reduced affine arithmetic has two advantages over the usual kind of interval arithmetic. First, it cancels errors correctly; if you calculate (y+1)(y+2)/(y+3) with interval arithmetic, with some uncertainty in the value of y, you usually get an unnecessarily loose bound on the result because interval arithmetic assumes that the uncertainties in the numerator and the denominator are uncorrelated, when in fact for many values of y they partly cancel out. (You can't apply this cancelation to aₙ, the error stemming from all the roundoffs in your algorithm, just the other aᵢ.) But the more interesting result is that RAA gives you a linear function that approximates the calculation result as a linear function of the inputs, which is pretty similar to calculating the gradient — but with error bounds on the approximation!

LSM trees are only about as "obscure" as bloom filters, since they underlie several prominent filesystems, LevelDB, and most of the world's search engines, but if you don't know about bloom filters, you might not know about LSM trees either. I wrote a tutorial explanation of LSM trees in https://github.com/kragen/500lines/tree/master/search-engine, though the name "LSM tree" hadn't yet gone mainstream.

Binary array sets, as described in https://www.nayuki.io/page/binary-array-set, are closely related to LSM trees, to the point that you could describe them as an application of LSM trees. They are a simple data structure for the exact version of the problem bloom filters solve approximately: maintaining a set S of items supporting the operations S.add(item) and S.contains?(item). Insertion (.add) is amortized constant time, and membership testing (.contains?) is O(log² N).

A different and more efficient data structure for the exact set membership problem, limited to the case where the items to be stored are up to M integers in the range [0, N), supports constant-time creation, clearing, insertion, size measurement, and membership testing, and linear-time enumeration, but I think cannot be implemented in modern standard C because of its rules about reading uninitialized data. (By allowing creation to be linear time you can implement it in standard C.) It uses two arrays, A[M] and B[N] and a variable n, and i is a member of the set iff B[i] < nA[B[i]] == i, from which it is straightforward to work out the other methods.

I don't remember what this data structure is called, but I definitely didn't invent it. I described it and some uses for it in https://dercuano.github.io/notes/constant-time-sets-for-pixe.... You can associate additional values with this structure to use it as a sparse array.

Oh, I see kcbanner posted this one too, with a link to Russ Cox's writeup, which is better than mine and explains that it was published by Briggs and Torczon in 01993, but probably not originally invented because it was an exercise in an Aho, Hopcroft, and Ullman textbook in 01974 and in Programming Pearls: https://research.swtch.com/sparse

Can I describe a data queueing problem that I feel like there is a specific data (or queue) structure for, but that I don't know the name is?

Let's say you are trying to "synchronize" a secondary data store with a primary data store. Changes in the primary data store are very "bursty", one row will not change for days, then it'll change 300 times in a minute. You are willing to trade a bit of latency (say 10 seconds) to reduce total message throughput. You don't care about capturing every change to the primary, you just want to keep it within 10 seconds.

It feels like there should be a clever way to "debounce" an update when another update overrides it 500ms later. I know debounce from the world of front-end UI where you wait a little when doing autocomplete search based on keyboard input so as not to overwhelm the search.

s.gif
I've seen this done with kinesis streams.

Basically just update a cache, and forward the results every so often.

If you put things in a stack, you can keep using the most recent for the update. Compare times and you won't add a bad value

s.gif
Ideally you have a reliable Change Data Capture (CDC) mechanism like a Binlog Reader. Debezium, for example, can write directly to a queue like Kafka. A Kafka consumer picks up the events and writes to your secondary datastore. Something like that can probably handle all of your events without bucketing them, but if you want to cut down the number of messages written to the queue you can add that logic into the Binlog Reader so it emits a burst every 5 seconds or so. During those 5 seconds it buffers the messages in the processes memory or externally in something like Redis using a key so only the latest message is stored for a given record.
s.gif
If you want everything to be within 10 seconds, then you build a state-change tracker (which only tracks all state changes since last update) and then you send the updates every 10 seconds.

Don't worry about debouncing - the state tracker should handle representing the 300 updates as a single state change, and if there are more then they just get sent in the next update 10 seconds later.

The humble array but with a twist for accessing its indices in a hardware multiplexer way with 'shift left' and 'or' bitwise operations.
    /* Bits: selected, hovered */
    const colors = [
      grey,  // 00
      green, // 01
      blueA, // 10
      blueB  // 11
    ]
    const color = colors[selected << 1 | hovered]

https://blog.uidrafter.com/bitwise-table-lookup
s.gif
No hardware multipliers here: left shifts are handled by much cheaper hardware [1], and are almost part of the basic arithmetic logic unit taught in school -- it can do addition, subtraction, bitwise operations, shifts by one, and maybe shifts by any number less than their word size.

[1] https://en.wikipedia.org/wiki/Barrel_shifter

A sorted list. It's often cheaper to sort the list after each change, then to search the whole list. Especially when you do collision detection in 3d... (maybe not so obscure, but at least underestimated) When I pair my socks I first place them in color order :)
s.gif
Note that re-sorting an array after making a constant number of changes can be done in O(n) and will usually be very fast in practice. Insertion sort will do it, as will Timsort and similar algorithms.
s.gif
I was reading up on C++'s standard library and found out that maps and sets are generally implemented as continually sorted lists
Bitboards for chess board representation, the fastest and most complex such representation: https://www.chessprogramming.org/Bitboards Other structures for chess are easier to follow and understand: https://www.chessprogramming.org/Board_Representation
First contact with Functors, multidimensional Kalman filtering abstractions, and Golomb ruler optimized DFT... kind of like meeting the Wizard of Oz, and finding out it was a chicken all along. ;)
>Good use-case: routing. Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements. Not so with a bloom filter! A bloom filter is one of the few data structures whose time complexity does not grow with the number of elements due to the 'keys' not needing to be stored ('search' and 'insert' is based on the number of hash functions.)

The easy and efficient way to test if a value is in a list is to use a hash set or dictionary. Complexity is always O(1).

s.gif
OP is probably mentioning bloom filter because it is space efficient. A good filter will use less space than entire hash set making it a good candidate to be kept in RAM to filter out requests and reduce expensive calls to read disk.
Bitboards, although perhaps not so obscure, allow for some very nice optimization tricks.

I've used them to solve Sudoku puzzles really fast.

Sketching data structures: count min sketch, hyperloglog, sliding window hyperloglog, t digest

Skiplists

Locality sensitive hashing

Log structured storage (bitcask, lsm)

Monotonic stacks are neat.

I made up a data structure once, consisting of a pyramid of deques. It lets you efficiently compute any associative function over a streaming window of data.

s.gif
You may be interested in the work myself and some coauthors have done on data structures and algorithms for streaming aggregation. GitHub repo for the code, which high level descriptions of their properties and pointers to papers: https://github.com/IBM/sliding-window-aggregators
s.gif
I recently learned about monotonic stacks thanks to this LeetCode problem: https://leetcode.com/problems/sum-of-total-strength-of-wizar...

I feel like they're quite possibly the most deceptively simple data structure I've yet to encounter. That is to say that for me at least there was/is a wide gulf between simply understanding what the data structure is (doesn't get much simpler than a stack!) and when/how to actually apply it.

The Alias method: O(n) in space and O(1) biased number generator

https://en.wikipedia.org/wiki/Alias_method

Mind was utterly blown when I first understood the way this works in my late teens.

Colonies/hives, which have fast insertion/erasure/iteration and preserve pointers on insertion/erasure. I remember that Cataclysm: DDA used this data structure for keeping track of the items on each map tile.

https://plflib.org/colony.htm

In a Scrabble AI coding contest I discovered the GADDAG which is like a trie but indexes all of the words forwards and backwards from any starting point within the words.

https://en.wikipedia.org/wiki/GADDAG#:~:text=A%20GADDAG%20is....

I'll just sit here hoping that all of the above datatypes get a nice rust crate :)
> Lets you test if a value is definitely NOT in a list of pre-stored values

wouldn’t a hashmap / dict / object do this in 0(1)? Input goes in, is hashed, the key exists or doesn’t exist?

Cuckoo hashing: https://en.wikipedia.org/wiki/Cuckoo_hashing

Hash tables that come with a parameter that lets you control density vs. speed.

Good! I got always fascinated by non-boolean indexing (probabilistic instead).

I've remember having implemented an OOP indexed version of U.C. Berkeley's Cheshire-II in 2007 with Dolphin Smalltalk.

I was using BTrees tho, hence O(log N). The non-boolean part was in the values of the keys which were probability of good match against your search target.

FST is one underappreciated data structure. Its used in places like speech recognition and synthesis, machine translation, optical character recognition, pattern matching, string processing, machine learning, information extraction. If you know about it you exactly know where to use it.
s.gif
+1 if your input and output are both sequences, FSTs can be your best friend…Unless you prefer the company of the flaky cool kids from deep learning (CTC, RNN-T, …)
I like consistent hashing (https://en.m.wikipedia.org/wiki/Consistent_hashing). When a hash table (or load balancing pool) needs to be resized, it usually reduces the number of keys (clients) that need to be remapped.
s.gif
Even simpler is Weighted Rendezvous Hashing (https://www.snia.org/sites/default/files/SDC15_presentations...).

It's quite a bit easier to implement and verify than consistent hashing, and carries the same benefits (minimal reshuffling on ring resizes, etc)

I am trying to compile a list of data structures (and algorithms and other stuff) into a kind of knowledge base: https://github.com/Dobatymo/data-algos/blob/master/data-stru.... There are few cool and obscure data structures already. Please feel free to help extend it!
String tries are pretty fun if you need to implement auto completion. I spent some time implementing my own on one project and open sourced it. And then brought it in on a few more projects.

I've used bloom filters with in memory caches a few times.

Skew heaps. They aren’t useful but their utter simplicity is really fun.
Ole Begemann and Chris Eidhof wrote Advanced Swift, and, in there, described a really efficient FIFO queue.

I implemented a variant of it, in my Generic Swift Toolbox Package[0]. It’s lightning fast.

[0] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...

So it’s not snazzy, but I also wrote Objective-C a long time without knowing about NSPointerArray.

Among other things, it can store nil values, which if you’ve written any Cocoa apps, is an all too common way to crash your software.

Resource Description Framework (RDF)

The Resource Description Format (RDF) is the W3C standard for web data. It allows for data the be linked, anywhere, and to be "grounded" in semantic descriptions of the data. The core RDF data model is very simple: It is a set of triples orgainzed into a RDF Graph.

https://www.w3.org/egov/wiki/Resource_Description_Format

I recently learned about deques in Python (other languages may have it too) which is like a linked list but it keeps pointers for the first and last elements. This allows you to insert/read/remove elements from the beginning or end of the list in constant time. I’ve never used it, but I could imagine it being useful in certain situations.

https://docs.python.org/3/library/collections.html

s.gif
A deque may be implemented as a linked list, but it's actually more common to be implemented on top of arrays.

Python in particular uses arrays under the hood for almost everything because modern computer hardware is so blazingly fast manipulating them.

s.gif
Arrays aren't efficient when you want to add/remove to the head. Deque in python exists so there is a data structure with constant time pop/push to the head. And it is in fact implemented as a doubly linked list, with various optimizations: https://github.com/python/cpython/blob/v3.8.1/Modules/_colle....
s.gif
It is not commonly done, but there is no reason why you can't have an array with amortised constant-time insertion at both the head and the tail.
s.gif
Neat, it's always best to look at the source. To be clear, a deque can be implemented on top of an array and still have constant time operations on the head.
s.gif
If it's not constant time on both ends, it's not a deque. You can already insert items at any index in an array with linear cost.
Djikstra maps. This page describes it better than I can: http://www.roguebasin.com/index.php/Dijkstra_Maps_Visualized but it can be a real simple way to add emergent behavior to an AI in a game (my interest in it), among other things.
Ropes. Something between array and tree.
I made https://github.com/mamcx/tree-flat as flattened stored tree in pre-order that allows for very fast iterations even for childs/parent queries. Is based on APL, so not that novel.

I also like a lot the relational model, is not that much represented so I making a language on top of it: https://tablam.org.

s.gif
Yeah, that one is good, but I bet is possible to encode trees efficiently as relations without indirections.

That is part of my look at `tree-flat` and I wanna base trees on that for my lang.

Not crazy obscure, but half-edge mesh data structures are super cool to me.
Nested Set Model: a way to model nested sets of entities and query them in a SQL database. I haven't used it for anything but it seems fun.

https://en.m.wikipedia.org/wiki/Nested_set_model

Piece tables: a persistent data structure used to represent the current document with relatively efficient editing operations in various text editors, word processors, etc.

https://en.wikipedia.org/wiki/Piece_table

The entire space of probabilistic and sublinear data structures is fascinating. From hyperloglog for high cardinality uniqueness counting, to Cuckoo filters (similar to bloom filters), or Count–min sketch for frequency counting.

https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...

I’ve bookmarked this thread. I have a new reading list! Thanks OP!
arrays are pretty underground, theyre like lists but better. C++ offers arrays but unfortunately python doesnt :(
s.gif
It does: what Python calls lists are actually (dynamic) arrays.

There is also this module, which provides unboxed arrays: https://docs.python.org/3/library/array.html

I always thought the relational calculus was awesome. There’s only one place I’ve seen it in a real world use case, outside of college, and that’s in a database called Suneido.
Actually came here to say bloom filter but you beat me to it.
Static arrays - when was the last time you used a static array?
s.gif
What do you mean by “static” exactly? It could be referring to an array that doesn’t change size. Or the contents are read-only. Or do you mean C/C++ static linking? Or something else?

In C, C++ and CUDA I use constant fixed-size arrays with lookup tables of specialized data all the time. I also use small arrays to hold temporary results that need to get first collected, and then sorted or processed together. Seems like very common in C++, especially embedded, game, and GPU programming. I’d love to hear about why static arrays seem uncommon in your world.

s.gif
Static arrays as in constant fixed sized arrays. With a static array, O(N) is faster than O(LogN) and even some cases O(1) ahem, hashes when N is small. A contiguous cache is king, and even with modern dynamic languages, the generational GC can even be completely bypassed or highly simplified if the JIT learns that memory access is just dependent on offsets.

Simple code runs fast.

See Pike on Complexity - Rule 3: https://www.lysator.liu.se/c/pikestyle.html

s.gif
I find I use fixed-sized arrays a lot on microcontrollers. You want to avoid the heap (for reliability) and you don't have a lot of RAM, so you must carefully size your fixed-size arrays.
Fountain Codes: reconstruct data from random selection of packets.

Was deeply patent encumbered until recently.

https://en.m.wikipedia.org/wiki/Fountain_code

s.gif
I searched this thread far and wide for fountain codes. Surprised they're so low on the list. I find them amazing. Aso, the name is really good. The fact that you can collect any random packets in any order invokes the idea of putting a cup under a fountain and letting it fill up with water. Love it.
The Trie is pretty cool. I think it's a bit obscure because the memory use can grow quite fast.
s.gif
That depends very much on how the trie is represented in memory. Something like a burst trie or HAT trie is usually very compact!
s.gif 21 more comments...
s.gif
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK