Redis 對 HyperLogLog 省空間的實作

HyperLogLog (HLL) 是用統計方式解決 Count-distinct problem 的資料結構以及演算法，不要求完全正確，而是大概的數量。

演算法其實沒有很難懂，在 2007 年的原始論文「HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm」裡面可以讀到演算法是長這樣：

可以看到一開始要決定好 b 的值 (於是就會有 2b 個 register)，以及單個 register M[j] 的大小，所以是一開始就會決定好固定大小，無論有多少元素都會先吃掉這麼多空間。

但在 Redis 的文件「HyperLogLog」裡面則是提到很少元素的時候會低於 12KB：

The magic of this algorithm is that you no longer need to use an amount of memory proportional to the number of items counted, and instead can use a constant amount of memory; 12k bytes in the worst case, or a lot less if your HyperLogLog (We'll just call them HLL from now) has seen very few elements.

網路上搜了一下沒看到怎麼做到的，不過直接翻 Redis 的程式碼 hyperloglog.c 可以看到答案。

在檔案開頭的註解可以看到有 16384 個 register (對應到論文裡面的 b = 14，因為 214 = 16384)，單個 register 的大小則是 6 bit (對應到論文裡面的 M[j])，相乘後是 12K bytes，剛好符合文件上的說明：

The use of 16384 6-bit registers for a great level of accuracy, using a total of 12k per key.

在「Dense representation」這邊也說明了每個 register 都是 6 bit 的存放方式，到這邊都與 HLL 論文提到的實作一樣。

省空間的方式是在「Sparse representation」這邊做到的，在大多數的 register 都沒有被設定的情況下，用這種方式可以省下大量的空間，而缺點是當元素「有點多」的時候會有比較高的 CPU time：

In the example the sparse representation used just 7 bytes instead of 12k in order to represent the HLL registers. In general for low cardinality there is a big win in terms of space efficiency, traded with CPU time since the sparse representation is slower to access.

依照註解上面的數字，看起來在 10000 個元素以下有機會低於 12KB，然後夠大的時候從 sparse 轉到 dense 上。

本來以為是什麼其他論文可以調整 b 參數 (enlarge)，結果是個比較像是 hack 的方式搞定，但的確是蠻有效的...

Google 的 HyperLogLog++

算是接續昨天寫的「Redis 對 HyperLogLog 省空間的實作」，在 Redis 的 HyperLogLog 實作有提到 Google 在 2013 年的論文「HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm」，裡面提出了 HyperLogLog++ (HLL++)。論文中 Google 提出來的改進主要有三個，第一個是用了 64-bit hash function： 5.1 Using a 64 Bit Hash Function 原因也有提到，當需要處理超過十億筆資料時，32-bit hash 的 4B 上限就有點不太夠用： To fulfill the requirement of…

March 21, 2024

In "Computer"

Reddit 在處理 Page View 的方式

Reddit 說明了他們如何處理 pageview：「View Counting at Reddit」。以 Reddit 的規模有提到兩個重點，第一個在善用 Redis 的 HyperLogLog 這個資料結構，當量大的時候其實可以允許有微小的誤差： The amount of memory varies per implementation, but in the case of this implementation, we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage! 維基百科上有說明當資料量在…

May 28, 2017

In "Cassandra"

HyperLogLog 與 Bloom Filter

看到 Facebook 在 Presto 裡增加使用 HyperLogLog 計算數量的能力，突然想到常常忘記這兩個拿準確度換速度的資料結構：「HyperLogLog in Presto: A significantly faster way to handle cardinality estimation」。 HyperLogLog (HLL) 是解決 Count-distinct problem 的資料結構，用很少的記憶體就可以給出一個誤差不大的值 (用 1.5KB 的空間處理十億等級的資料，誤差大約在 2%)，所以 Presto 利用這個資料結構引進了 APPROX_DISTINCT() 以及其他的函數，就很容易在 L2/L3 cache 裡運算，藉此大幅提昇速度。 Depending upon the problem at hand, we can achieve speed improvements of anywhere from 7x to…

December 14, 2018

In "Computer"

Author Gea-Suan LinPosted on March 20, 2024March 20, 2024Categories Computer, Murmuring, SoftwareTags algorithm, count, data, distinct, hyperloglog, problem, redis, structure

Mentions

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

Notify me of follow-up comments by email.

Notify me of new posts by email.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Learn More)

Redis 對 HyperLogLog 省空間的實作

Redis 對 HyperLogLog 省空間的實作

Related

Google 的 HyperLogLog++

Reddit 在處理 Page View 的方式

HyperLogLog 與 Bloom Filter

Mentions

Leave a Reply

Post navigation

Recommend

华为2023年营收7042亿元，轮值董事长称整体符合预期

Submit directly from VS Code during contests tool

《HelloGitHub》第 96 期

SEC scores big win in lawsuit against crypto exchange Coinbase

Instagram is working on new Reels feed that combines two users' interests

Talking with Tal Danino, Author of "Beautiful Bacteria"

未来家居周刊| Pinterest 2024流行趋势：全铝家具将成主流？

jank development update - Syntax quoting!

消息队列的七种经典应用场景

可能是迄今为止最好用的WPF加载动画功能(没有之一)

About Joyk