structure – Page 2 – Gea-Suan Lin's BLOG

PostgreSQL 的 Bloom index

前幾天才跟人提到 PostgreSQL 的功能與完整性比 MySQL 多不少，剛剛又看到 Percona 的「Bloom Indexes in PostgreSQL」這篇，裡面提到了 PostgreSQL 可以使用 Bloom filter 當作 index。

查了一下資料是從 PostgreSQL 9.6 支援的 (參考「PostgreSQL: Documentation: 9.6: bloom」這邊的說明)，不過說明裡面沒看到 DELETE (以及 UPDATE) 會怎麼處理，因為原版的 Bloom filter 資料結構應該沒有能力處理刪除的情況...

另外這幾年比較有名的應該是 Cuckoo filter，不只支援刪除，而且空間與效能都比 Bloom filter 好，不知道為什麼是實做 Bloom filter...

PostgreSQL 裡的 B-tree 結構

在「Indexes in PostgreSQL — 4 (Btree)」這邊看到講 PostgreSQL 的 B-tree 結構以及常見的查詢會怎麼使用 B-tree。

裡面講了三種查詢，第一種是等號的查詢 (Search by equality)，第二種是不等號的查詢 (Search by inequality)，第三種是範圍的查詢 (Search by range)。再後面講到排序與 index 的用法。

雖然是分析 PostgreSQL，但裡面是一般性的概念，其他使用 B-tree 結構的資料庫也是類似作法...

YAML 的痛點

在 Changelog 上看到「In defense of YAML」這篇講 YAML 的問題，裡面是引用「In Defense of YAML」這篇文章。

未必全盤接受文章裡面的說法，但裡面提到的兩個點的確是痛點，第一個是空白 (或者說 indent)，第二格式特殊語法。這兩個是用 YAML 時都很頭痛的問題：

Whitespace is a minefield. Its syntax is surprisingly complex.

就像 JavaScript 的 == 一樣 (我指的是之前寫的「JavaScript 的 == 條列式比較」這篇)，你可以把定義背下來，但你會覺得沒什麼道理可言而有種無奈的感覺...

文章裡也有提到 JSON 內沒有 comment 的設計的確是用起來比較無奈的地方...

Mark Callaghan 花五分鐘介紹 LSM trees

實做 MyRocks 的 Mark Callaghan 花五分鐘在 CIDR 2019 上介紹 LSM tree：「Geek code for LSM trees」。翻了一下發現 CIDR 是兩年辦一次，跟之前遇過的 conference 不太一樣...

投影片在「Diversity of LSM tree shapes」這邊可以看到，在五分鐘內講完的前提下規劃出的重點...

HyperLogLog 與 Bloom Filter

看到 Facebook 在 Presto 裡增加使用 HyperLogLog 計算數量的能力，突然想到常常忘記這兩個拿準確度換速度的資料結構：「HyperLogLog in Presto: A significantly faster way to handle cardinality estimation」。

HyperLogLog (HLL) 是解決 Count-distinct problem 的資料結構，用很少的記憶體就可以給出一個誤差不大的值 (用 1.5KB 的空間處理十億等級的資料，誤差大約在 2%)，所以 Presto 利用這個資料結構引進了 APPROX_DISTINCT() 以及其他的函數，就很容易在 L2/L3 cache 裡運算，藉此大幅提昇速度。

Depending upon the problem at hand, we can achieve speed improvements of anywhere from 7x to 1,000x.

先前也提過 Reddit 也用 HLL 統計資料：「Reddit 在處理 Page View 的方式」。

Bloom Filter 也是在處理大量資料的問題，但這個資料結構的功能不太一樣，是給出「有沒有存在」，使用空間與誤差大約是 10 bits per key (1% false positive)，另外先前也有提到一些變形，可以提供其他功能。像是「Quotient filter」與「Cuckoo Filter：比 Bloom Filter 多了 Delete」。

Instagram 解決 Cassandra 效能問題的方法

在解決 Cassandra 效能問題中大概就 ScyllaDB 特別有名，用 C++ 重寫一次使得效能大幅改善。而 Instagram 的人則是把底層的資料結構換掉，改用 RocksDB (這公司真的很愛自家的 RocksDB...)：「Open-sourcing a 10x reduction in Apache Cassandra tail latency」。

主要原因是他們發現 Cassandra 在處理資料的部份會有 JVM 的 GC 問題，而且是導致 Cassandra 效能差的主要原因：

Apache Cassandra is a distributed database with it’s own LSM tree-based storage engine written in Java. We found that the components in the storage engine, like memtable, compaction, read/write path, etc., created a lot of objects in the Java heap and generated a lot of overhead to JVM.

然後在換完後測試可以看到效能大幅提昇，也可以看到 GC 的延遲大幅降低：

In one of our production clusters, the P99 read latency dropped from 60ms to 20ms. We also observed that the GC stalls on that cluster dropped from 2.5% to 0.3%, which was a 10X reduction!

比較一下這兩者的差異：在 ScyllaDB 是全部都用 C++ 改寫 (資料結構不換)，這樣就直接解決掉 JVM 的 GC 問題。在 Rocksandra 則是在 profiling 後挑重點換掉 (這邊看起來是處理資料的 code，直接換成 RocksDB)，另外順便把一些界面抽象化... 兩個不一樣的解法，都解決了 JVM 的 GC 問題。

Quotient filter

之前有提過「Cuckoo Filter：比 Bloom Filter 多了 Delete」，最近在「A general purpose counting filter: making every bit count」這邊看到 Quotient filter，也是類似 Bloom filter 的資料結構，但想要解決更多問題。

一般的 Bloom filter (BF) 會有這些問題：

The inability to delete items
Poor scaling out of RAM
The inability to resize dynamically
The inability to count the number of occurrences of each item, especially with skewed input distributions.

而文章裡提到的 Quotient filter (QF) 就是要解這些問題。另外還提到了 Rank-and-Select-based Quotient filter (RSQF) 以及 Counting Quotient filter (CQF)。雖然多了一些空間需求，但看起來解掉不少問題... (尤其是刪除的能力)

效能上也還不錯，尤其是讀取速度的部份... 不過不知道相對於 Cuckoo filter 差多少。

GitHub 的組織管理可以堆階層了...

GitHub 的組織管理可以堆階層了：「Nested teams add depth to your team structure」。

If you're a member of Engineering and someone creates a child team called Security, team members of Engineering aren't automatically direct team members of Security. Security and all other teams nested under the Engineering will inherit repository permissions and @mentions but nothing else.

包括了權限繼承的概念。

這功能等好久了，剛好最近會用到... 本來得硬幹做，現在看起來可以比較方便的管理了。

Reddit 在處理 Page View 的方式

Reddit 說明了他們如何處理 pageview：「View Counting at Reddit」。

以 Reddit 的規模有提到兩個重點，第一個在善用 Redis 的 HyperLogLog 這個資料結構，當量大的時候其實可以允許有微小的誤差：

The amount of memory varies per implementation, but in the case of this implementation, we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage!

維基百科上有說明當資料量在 10⁹ 這個等級時，用 1.5KB 的記憶體只有 2% 的誤差值：

The HyperLogLog algorithm is able to estimate cardinalities of > 10⁹ with a typical error rate of 2%, using 1.5 kB of memory.

第二個則是寫入允許短時間的誤差 (pageview 不會即時反應)，透過批次處理降低對 Cassandra cluster 的負荷：

Writes to Cassandra are batched in 10-second groups per post in order to avoid overloading the cluster.

可以注意到把 Redis 當作 cache 層而非 storage 層。

主要原因應該跟 Redis 定位是 data structure server 而非 data structure storage 有關 (可以從對 Durability 的作法看出來)，而使用 Cassandra 存 key-value 非常容易 scale，但讀取很慢。剛好兩個相輔相成。

在面試時的資料結構與演算法的問題

在「500 Data structures and algorithms interview questions and their solutions」這邊看到在 Quora 上整理出來的題目 (以及解答)。

每個題目下面也都有地方可以留言，等久一點應該會更豐富？

另外一個不錯的地方在於題目的分類，舉例來說，如果想要練習 Backtracking，可以去翻對應的題目出來練。