John Fremlin's blog: Hashtable addition time (draft)

Waiting for updates: connected

Posted 2012-07-19 04:57:00 GMT

I was interviewing a candidate and, given their background, I was surprised to find that they didn't know that hashtables have non-constant time insertions. A constant time insertion hashtable with constant time lookup by key is a sleight of hand: an unsustainable impractical proposition like a fixed space representation of numbers. Eventually it runs out.

Now hashtables can have amortized O(1) element access given fixed size elements, but anybody who has tried to build up a big index knows that they have much worse than O(n) time cost for inserting n elements.

The reason is that occasionally the table has to be rebalanced. And this rebalancing takes some time.

To maintain the O(1) element access time the buckets of the hashtable must be bounded in size, k. Therefore when one bucket has exceeded this size the table must be rebalanced. Even if we are very generous with the number of buckets, when a large number of items are inserted one bucket is likely to overflow (the birthday problem again). At that point the table must be rebalanced.

Assuming an entirely new hashtable has to be created on rebalancing, then the cost is O(n) for one rebalancing. Clearly the number of rebalancings must increase as n increases, though this is as far as I know not a pretty formula. Therefore the time complexity of rebalancing is O(n) for the actual insertions, then in addition O(r(n)) where r(n) is an increasing function of n, for the rebalancings. So the insertion is definitely more than O(n). For k = 1, by the birthday problem I think this comes to O(n3/2) but for higher ks I am uncertain.

For a concrete example, I wrote the following program using a decent hashtable implementation and timing the inserting of batches of 10M new elements.

#include <unordered_map>
#include <iostream>
#include <boost/timer/timer.hpp>

int main() { std::unordered_map<double, int> m; int i = 0;

for (;;) { boost::timer::auto_cpu_timer _; for (int j = 0; 10000000 > j; ++j) { m[i] = i++; } } }

which demonstrates decidedly non-linear running time for insertions

 8.633190s wall, 8.330000s user + 0.290000s system = 8.620000s CPU (99.8%)
 13.141784s wall, 12.780000s user + 0.330000s system = 13.110000s CPU (99.8%)
 7.938090s wall, 7.740000s user + 0.190000s system = 7.930000s CPU (99.9%)
 21.005692s wall, 20.410000s user + 0.560000s system = 20.970000s CPU (99.8%)
 8.781604s wall, 8.520000s user + 0.240000s system = 8.760000s CPU (99.8%)
 8.410468s wall, 8.240000s user + 0.150000s system = 8.390000s CPU (99.8%)

If you have a reasonable bound on the size of a hashtable (so if you reach that bound you promise to delete before you insert) it probably does make sense to model insertions as O(1), I suppose.

Posted 2012-07-20 23:04:15 GMT by James Cranch from 87.112.83.126

Right: but in practice you will often scale up the size of the machine to match the size of the desired hash-table. So suppose you know you need to store a billion 64-bit entries -> 64 keys: you might scrape by with a 64GB machine. But then there is almost certainly going to be some rebalancing as you build up this index. . .

Posted 2012-07-23 07:19:29 GMT by John Fremlin

John Fremlin's blog: Hashtable addition time (draft)

John Fremlin's blog: Hashtable addition time (draft)

Waiting for updates: connected

Recommend

John Fremlin's blog: Testing systems at scale

John Fremlin's blog: Rotating an image with OpenCV and Python

John Fremlin's blog: Exactly-Once in multithreaded async Python

John Fremlin's blog: Recruiting software engineers and their CVs

John Fremlin's blog: A single point of failure is ok

John Fremlin's blog: Nothing was a billion dollar mistake

John Fremlin's blog: Kotlin is a better Java

John Fremlin's blog: Curse of business logic

John Fremlin's blog: Square CTF 2017 Grace Hopper

John Fremlin's blog: How to optimise Common Lisp under SBCL: don't profile (draf...

About Joyk