5.1 Using a 64 Bit Hash Function

To fulfill the requirement of being able to estimate multisets of cardinalities beyond 1 billion, we use a 64 bit hash function.

It would be possible to introduce a similar correction if the cardinality
approaches 264, but it seems unlikely that such cardinalities are encountered in practice.

5.2 Estimating Small Cardinalities

5.3 Sparse Representation

Example: Actual answer is within 5 ± 1 with prob ≥ 0.9

Redis 對 HyperLogLog 省空間的實作

HyperLogLog (HLL) 是用統計方式解決 Count-distinct problem 的資料結構以及演算法，不要求完全正確，而是大概的數量。

The magic of this algorithm is that you no longer need to use an amount of memory proportional to the number of items counted, and instead can use a constant amount of memory; 12k bytes in the worst case, or a lot less if your HyperLogLog (We'll just call them HLL from now) has seen very few elements.

The use of 16384 6-bit registers for a great level of accuracy, using a total of 12k per key.

In the example the sparse representation used just 7 bytes instead of 12k in order to represent the HLL registers. In general for low cardinality there is a big win in terms of space efficiency, traded with CPU time since the sparse representation is slower to access.

HyperLogLog 與 Bloom Filter

HyperLogLog (HLL) 是解決 Count-distinct problem 的資料結構，用很少的記憶體就可以給出一個誤差不大的值 (用 1.5KB 的空間處理十億等級的資料，誤差大約在 2%)，所以 Presto 利用這個資料結構引進了 APPROX_DISTINCT() 以及其他的函數，就很容易在 L2/L3 cache 裡運算，藉此大幅提昇速度。

Depending upon the problem at hand, we can achieve speed improvements of anywhere from 7x to 1,000x.

Bloom Filter 也是在處理大量資料的問題，但這個資料結構的功能不太一樣，是給出「有沒有存在」，使用空間與誤差大約是 10 bits per key (1% false positive)，另外先前也有提到一些變形，可以提供其他功能。像是「Quotient filter」與「Cuckoo Filter：比 Bloom Filter 多了 Delete」。

Reddit 在處理 Page View 的方式

Reddit 說明了他們如何處理 pageview：「View Counting at Reddit」。

The amount of memory varies per implementation, but in the case of this implementation, we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage!

The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5 kB of memory.

Writes to Cassandra are batched in 10-second groups per post in order to avoid overloading the cluster.