Instagram 解決 Cassandra 效能問題的方法

在解決 Cassandra 效能問題中大概就 ScyllaDB 特別有名,用 C++ 重寫一次使得效能大幅改善。而 Instagram 的人則是把底層的資料結構換掉,改用 RocksDB (這公司真的很愛自家的 RocksDB...):「Open-sourcing a 10x reduction in Apache Cassandra tail latency」。

主要原因是他們發現 Cassandra 在處理資料的部份會有 JVM 的 GC 問題,而且是導致 Cassandra 效能差的主要原因:

Apache Cassandra is a distributed database with it’s own LSM tree-based storage engine written in Java. We found that the components in the storage engine, like memtable, compaction, read/write path, etc., created a lot of objects in the Java heap and generated a lot of overhead to JVM.

然後在換完後測試可以看到效能大幅提昇,也可以看到 GC 的延遲大幅降低:

In one of our production clusters, the P99 read latency dropped from 60ms to 20ms. We also observed that the GC stalls on that cluster dropped from 2.5% to 0.3%, which was a 10X reduction!

比較一下這兩者的差異:在 ScyllaDB 是全部都用 C++ 改寫 (資料結構不換),這樣就直接解決掉 JVM 的 GC 問題。在 Rocksandra 則是在 profiling 後挑重點換掉 (這邊看起來是處理資料的 code,直接換成 RocksDB),另外順便把一些界面抽象化... 兩個不一樣的解法,都解決了 JVM 的 GC 問題。

Netflix 在 Time Series Data Storage 上的努力...

在「Scaling Time Series Data Storage — Part I」這篇看到 Netflix 在 Time Series Data Storage 上所做的努力...

因為應用在寫多讀少的場景,所以選擇使用 Cassandra,遇到瓶頸後把常寫入的與不太會改變的拆開儲存,並且用不同方式最佳化。包括了 cache 與 compression 都拿出來用了...

不知道他們內部有沒有評估 ScyllaDB 的想法...

從 Cassandra 到 ScyllaDB 的轉移方式好像跟以前不太一樣了...

在「New Docs: Four Phases to Migrate from Apache Cassandra to Scylla」這邊看到 ScyllaDB 官方提供 Cassandra 轉移到 ScyllaDB 的說明,跟以前好像差蠻多的...

以前 ScyllaDB 可以直接加入到 Cassandra 的 cluster (一時間沒找到資料,但在「can not add node with cassandra ami · Issue #107 · scylladb/scylla-cluster-tests」可以看到當時的痕跡),現在給的方法是在資料庫不相容時的轉移方式 (像是從 MySQL 轉換到 PostgreSQL 這種),是暗示已經沒辦法這樣做了嗎?

不過從 GitHub 上的 wiki page 看起來,底層資料與 protocol 應該還是相容的,才能做直接複製資料的 offline migration:「Migrating Cassandra data to Scylla」。

也有可能這篇只是寫手隨意寫的文章,沒有把 ScyllaDB 的優勢展現出來...

About John Hammink
John Hammink is a writer and content creator at ScyllaDB. With more than 20 years in technology, he's also a touring/studio musician, digital artist and speaker.

ScyllaDB 2.0 要引入 Cassandra 3.0+ 的 Materialized View

最近 ScyllaDB 的網站改版了... (有種不習慣的感覺)

ScyllaDB 2.0 打算要引入 Materialized View (出自 Apache Cassandra 3.0+):「Materialized Views preview in Scylla 2.0」。

一般 Materialized View 的實做方式是另外存一份,所以你可以在上面加 Index 之類的設定讓存取速度變快...

不過 Cassandra 不是本來就以讀慢寫快為優勢嗎,要速度可以考慮用 cache 疊出來,或是其他方式,當初 Cassandra 會開發這個功能就有點... XDDD

Apache Foundation 宣佈禁止使用 Facebook BSD+Patents 的軟體

在「RocksDB Integrations」這邊討論到 RocksDBFacebook 所使用的 Facebook BSD+Patents License。

不過因為 RocksDB 最近在換 license (從 Facebook BSD+Patents 換到 Apache License, Version 2.0),移除了 PATENTS 內的限制,需要看 PATENTS 的舊檔案可以在 PATENTS 這邊看到。

Chris Mattmann 正式發出決議禁用 Facebook BSD+Patents License。(參考最後)

另外也提到了 Facebook 是故意埋下這些限制:

Note also Roy's comment that he has discussed the matter with FB's counsel and the word is that the FB license is intentionally incompatible. It is hard to make the argument that it is compatible after hearing that. Pragmatically speaking, regardless of any semantic shaving being done, having a statement like that from the source of the license is very daunting. If they think it is incompatible, we need to not try to wheedle and convince ourselves it is not.

這個 license 之後應該會有更多挑戰...


As some of you may know, recently the Facebook BSD+patents license has been
moved to Category X (
Please see LEGAL-303 [1] for a discussion of this. The license is also referred
to as the ROCKSDB license, even though Facebook BSD+patents is its more
industry standard name.

This has impacted some projects, to date based on LEGAL-303
and the detective work of Todd Lipcon:

Samza, Flink, Marmotta, Kafka and Bahir

(perhaps more)

Please take notice of the following policy:

o No new project, sub-project or codebase, which has not
  used Facebook BSD+patents licensed jars (or similar), are allowed to use
  them. In other words, if you haven't been using them, you
  aren't allowed to start. It is Cat-X.

o If you have been using it, and have done so in a *release*,
  you have a temporary exclusion from the Cat-X classification thru
  August 31, 2017. At that point in time, ANY and ALL usage
  of these Facebook BSD+patents licensed artifacts are DISALLOWED. You must
  either find a suitably licensed replacement, or do without.
  There will be NO exceptions.

o Any situation not covered by the above is an implicit
  DISALLOWAL of usage.

Also please note that in the 2nd situation (where a temporary
exclusion has been granted), you MUST ensure that NOTICE explicitly
notifies the end-user that a Facebook BSD+patents licensed artifact exists. They
may not be aware of it up to now, and that MUST be addressed.

If there are any questions, please ask on the legal-discuss@a.o


Chris Mattmann
VP Legal Affairs


Reddit 在處理 Page View 的方式

Reddit 說明了他們如何處理 pageview:「View Counting at Reddit」。

以 Reddit 的規模有提到兩個重點,第一個在善用 RedisHyperLogLog 這個資料結構,當量大的時候其實可以允許有微小的誤差:

The amount of memory varies per implementation, but in the case of this implementation, we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage!

維基百科上有說明當資料量在 109 這個等級時,用 1.5KB 的記憶體只有 2% 的誤差值:

The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5 kB of memory.

第二個則是寫入允許短時間的誤差 (pageview 不會即時反應),透過批次處理降低對 Cassandra cluster 的負荷:

Writes to Cassandra are batched in 10-second groups per post in order to avoid overloading the cluster.

可以注意到把 Redis 當作 cache 層而非 storage 層。

主要原因應該跟 Redis 定位是 data structure server 而非 data structure storage 有關 (可以從對 Durability 的作法看出來),而使用 Cassandra 存 key-value 非常容易 scale,但讀取很慢。剛好兩個相輔相成。

ScyllaDB 1.7 支援 Counters 了

在「Scylla release: version 1.7」這邊看到 ScyllaDB 支援 Counters 的消息了 (雖然剛出來,掛著 Experimental 的消息):

Scylla now supports Counters as a native type. A counter column is a column whose value is a 64-bit signed integer and on which two operations are supported: incrementing and decrementing.

這其實是 Cassandra 其中一個強項,針對 counter 這種應用特化的資料型態。

ScyllaDB 1.7 要支援 Cassandra 的 Counter 了

Counter 是 Cassandra 裡一個特別的資料型態,總算是要 porting 進 ScyllaDB 了:「Scylla 1.7 introduces experimental support for counters」。


  • once deleted, counter column value cannot be used again—this is a consequence of the fact that counters can only be incremented or decremented, they cannot be set to any specific value
  • a table cannot contain both counter and regular columns—without this limitation it wouldn’t be possible to provide proper atomicity and isolation guarantees for updates that modify both counters and regular columns in a single row
  • counter columns cannot be members of primary key
  • updates are not idempotent – in case of a write failure the client cannot safely retry the request
    counter columns cannot have time-to-live set

跟 Cassandra 一樣是透過 CRDT 實做的,行為上會是 eventually consistency:

Scylla implements counters using state-based conflict-free replicated data types (CRDT), which allow atomic operations, like increment or decrement, to be performed locally without the need for synchronization between nodes.

在「Scylla Status and Roadmap」這邊可以看到其他會在 1.7 出現的功能。

Scylla 1.4 系列的發佈

ScyllaDB 最近發行了 Scylla 1.4 與 1.4.1:「Scylla release: version 1.4」與「Scylla release: version 1.4.1」,另外也整理出 Docker 版本:「Scylla on Docker」。

可以從 1.4 的公告裡面看到功能愈來愈完整了,在導入其他跟 Cassandra 配合的軟體應該會愈來愈順,而且就之前用 Presto 而去 Scylla 的 GitHub 上回報問題的經驗,Scylla 的人對於能夠生出可重製的 bug report 還蠻重視的,解決速度都還算合理...

另外提供 Docker image 也讓想要測試的人變方便...

Scylla 1.3

看到 Scylla 正式公告 1.3 版的消息了:「Scylla release: version 1.3」。

Scylla 是用 C++ 重寫 Java 版本的 Cassandra 所有東西 (包括資料結構與 Protocol),目標是做到可以完全相容替換現有 Cassandra Cluster。(號稱可以一台一台移除 Cassandra 的程式,裝上 Scylla 後就可以無痛換過去)

而 Scylla 另外一個重點是效能的提昇,官方宣稱在完整最佳化的情況下是 10x 以上的效能提昇,之前拿 AWS 實測 (沒有完整最佳化) 也可以看到 2x 到 4x 的數字,對於目前的 Cassandra 應用來說極為重要。

1.3 版最重要的功能就是對 Thrift 的支援:

Thrift support. Many Cassandra users are still using Thrift, and they can now continue doing so while benefiting from Scylla’s performance. Built on top of Scylla CQL internal implementation, Scylla Thrift provides similar throughput and latency to Scylla CQL. Users of projects like KairosDB and Titan can now migrate to Scylla while maintaining full protocol compatibility .

本來在 roadmap 上的計畫是用兩個版本支援 Thrift:(從 Google Cache 拉出來的,CSS 看起來有些問題,不過意思有到就好)

剛剛發現 1.4 的 roadmap 已經沒有列 Thrift 了:

這應該是暗示已經實作完了?透過 Thrift 界面跟 Cassandra 溝通的應用程式都可以使用 Scylla 了...

先前在「Facebook Presto · Issue #1139 · scylladb/scylla」這邊跟 ScyllaDB 的人花了不少時間,總算是給出一份 data set 可以讓他們重製 bug,也算是有代價了 XD