一個有趣的面試問題

Hacker News Daily 上看到「Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust」這個有趣的面試問題,在 Hacker News 上的討論也可以看看:「Performance comparison: counting words in Python, Go, C++, C, Awk, Forth, Rust (benhoyt.com)」。

問題是這樣:

Write a program to count the frequencies of unique words from standard input, then print them out with their frequencies, ordered most frequent first. For example, given this input:

The foo the foo the
defenestration the

The program should print the following:

the 4
foo 2
defenestration 1

在面試時,重點不在於用哪個程式語言,而是在面試時一路往下問,像是 profiling 的部份,內部資料結構的部份,可以問得很深。

撇開面試,這個問題其實是個經典問題,當年 Donald KnuthDoug McIlroy 兩位大師都有玩過:

Incidentally, this problem set the scene for a wizard duel between two computer scientists several decades ago. In 1986, Jon Bentley asked Donald Knuth to show off “literate programming” with a solution to this problem, and he came up with an exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the inventor of Unix pipelines) replied with a one-liner Unix shell version using tr, sort, and uniq.

不過當年玩的問題有點變形:

Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

他們當時其實一個是在示範 Literate programming,而另外一個在展現 pipe 的威力,都是借題發揮而已,跟上面的那些東西倒是沒什麼太大關係。

Quotient filter

之前有提過「Cuckoo Filter:比 Bloom Filter 多了 Delete」,最近在「A general purpose counting filter: making every bit count」這邊看到 Quotient filter,也是類似 Bloom filter 的資料結構,但想要解決更多問題。

一般的 Bloom filter (BF) 會有這些問題:

  • The inability to delete items
  • Poor scaling out of RAM
  • The inability to resize dynamically
  • The inability to count the number of occurrences of each item, especially with skewed input distributions.

而文章裡提到的 Quotient filter (QF) 就是要解這些問題。另外還提到了 Rank-and-Select-based Quotient filter (RSQF) 以及 Counting Quotient filter (CQF)。雖然多了一些空間需求,但看起來解掉不少問題... (尤其是刪除的能力)

效能上也還不錯,尤其是讀取速度的部份... 不過不知道相對於 Cuckoo filter 差多少。

Reddit 在處理 Page View 的方式

Reddit 說明了他們如何處理 pageview:「View Counting at Reddit」。

以 Reddit 的規模有提到兩個重點,第一個在善用 RedisHyperLogLog 這個資料結構,當量大的時候其實可以允許有微小的誤差:

The amount of memory varies per implementation, but in the case of this implementation, we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage!

維基百科上有說明當資料量在 109 這個等級時,用 1.5KB 的記憶體只有 2% 的誤差值:

The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5 kB of memory.

第二個則是寫入允許短時間的誤差 (pageview 不會即時反應),透過批次處理降低對 Cassandra cluster 的負荷:

Writes to Cassandra are batched in 10-second groups per post in order to avoid overloading the cluster.

可以注意到把 Redis 當作 cache 層而非 storage 層。

主要原因應該跟 Redis 定位是 data structure server 而非 data structure storage 有關 (可以從對 Durability 的作法看出來),而使用 Cassandra 存 key-value 非常容易 scale,但讀取很慢。剛好兩個相輔相成。