資料庫在 EC2 上選擇 Instance Type 的方向

ScyllaDBCassandra 的 C++ 相容版本,效能比起 Java 版本的好不少 (尤其是與 CPU 與記憶體有關的部份)。

ScyllaDB 的人上個月給了一份指南,主要是在講在 Amazon EC2 上怎麼選 instance type 跑 NoSQL (主要還是針對 ScyllaDB 的情境下分析)。不過道理是通的:「Choosing EC2 instances for NoSQL」。

不同於 Cassandra 比較容易吃到 CPU bound,ScyllaDB 比較容易吃到 i/o bound,所以 i/o 的效能對於選擇 instance type 重要許多。

後面也有提到 instance size 的問題 (八台 xlarge 還是一台 8xlarge),不過感覺沒有給很清楚的方向。一般來說,分散式資料庫之間溝通還是有不少成本在,另外文章裡也提到同一台實體機器的鄰居造成 i/o noise 的問題,看起來在經濟規模夠大的情況下,開到最大台才是王道啊?

Google 發表了三個 Hash 演算法的實作

Google 發表了三個 Hash 演算法的實作:「New algorithms may lower the cost of secure computing」。

第一個是 SipHash 的加速實作,透過 AVX-2 指令集加速,看維基百科的資料,2011 後的 Intel/AMD CPU 似乎都有提供這組指令集:

Our first hash function produces the same output as SipHash, but 1.5 times as quickly thanks to AVX-2 instructions.

第二個是 SipHash 的改良版,但輸出不同 (所以不是 SipHash),但速度比 SipHash 更快:

The second improvement uses j-lanes tree hashing to process multiple inputs in parallel, which is 3 times as fast. This technique is known to be secure, but produces different output than the original SipHash and is slightly slower for short inputs.

第三個則是新的 Hash,速度比前兩者又更快了,但還需要有更多人分析才能確認安全性:

HighwayHash is based on a new way of mixing inputs with just a few AVX-2 multiply and permute instructions. We are hopeful that the result is a cryptographically strong pseudorandom function, but new cryptanalysis methods might be needed for analyzing this promising family of hash functions. HighwayHash is significantly faster than SipHash for all measured input sizes, with about 7 times higher throughput at 1 KiB.

三者的程式碼都可以在 GitHub 上的「google/highwayhash」找到,看 LICENSE 檔案是 Apache License 2.0

把塗鴉透過深度類神經網路轉成油畫...

看到「Neural Doodle」這個專案,可以把塗鴉轉成帶有油畫筆觸的圖:

看起來是實作新的演算法:

Use a deep neural network to borrow the skills of real artists and turn your two-bit doodles into masterpieces! This project is an implementation of Semantic Style Transfer (Champandard, 2016), based on the Neural Patches algorithm (Li, 2016).

另外一組範例:

程式可以用純 CPU 跑,也可以用 GPU 跑,不管哪種都很吃記憶體 XDDD

Google Compute Engine 推出 Custom Machine Type

Google Compute Engine 推出了可以自己設定 CPU 與 RAM 的機器種類:「Custom Machine Types - Compute Engine — Google Cloud Platform」。

可以從 1 個 vCPU 到 32 個 vCPU,而記憶體最多是 6.5GB * vCPU 數,所以理論上最高是 208GB?

Create a machine type with as little as 1 vCPU and up to 32 vCPUs, or any even number of vCPUs in between. Memory can be configured up to 6.5 GB of RAM per vCPU.

計價方式就是 vCPU 算一份,記憶體算一份。記得以前有比較小的 Cloud Service 有提供過類似的計價方式,後來都收掉了...

MySQL 上 InnoDB 與 RocksDB 的差異

Mark Callaghan 在「MyRocks vs InnoDB with Linkbench over 7 days」這邊分析了 MySQL 上的兩個 engine 的差異,總結是大獲全勝:

  • InnoDB writes between 8X and 14X more data to SSD per transaction than RocksDB
  • RocksDB sustains about 1.5X more QPS
  • Compressed/uncompressed InnoDB uses 2X/3X more SSD space than RocksDB

但 InnoDB 的 engine 是 2000 年的設計 (16 年前),MyRocks 的 engine (RocksDB) 則是 2013 年的設計,不屬於同一個世代。

相較於與 InnoDB 對打,我更想看到的是與 TokuDB (2012 年的設計) 對打的結果。

兩個新的 engine 都有針對 SSD 的特性發展,可以看出資料結構與寫入的方式就很不一樣,而且一開始就是在多 CPU 多核環境下開發,相較於 InnoDB 是一路改的包袱來的輕鬆許多。

Facebook 更新 iOS 應用程式,修正吃電問題

在「在 iOS 上不使用 Facebook App 時要完全砍掉 process」這邊提到了 Facebook 在 iOS 版的應用程式會在背景播放無聲音樂,導致吃電特別兇的問題,Facebook 的 Ari Grant 出來澄清是 bug 造成的,而非故意行為。

修正了兩個 bug,第一個是 network code 的部分:

The first issue we found was a “CPU spin” in our network code. A CPU spin is like a child in a car asking, “Are we there yet? Are we there yet? Are we there yet?”with the question not resulting in any progress to reaching the destination. This repeated processing causes our app to use more battery than intended. The version released today has some improvements that should start making this better.

第二個則是之前提到無聲 audio 的問題:

The second issue is with how we manage audio sessions. If you leave the Facebook app after watching a video, the audio session sometimes stays open as if the app was playing audio silently. This is similar to when you close a music app and want to keep listening to the music while you do other things, except in this case it was unintentional and nothing kept playing. The app isn't actually doing anything while awake in the background, but it does use more battery simply by being awake. Our fixes will solve this audio issue and remove background audio completely.

同時澄清並沒有要在背景更新取得地理位置資訊:

The issues we have found are not caused by the optional Location History feature in the Facebook app or anything related to location. If you haven't opted into this feature by setting Location Access to Always and enabling Location History inside the app, then we aren't accessing your device's location in the background. The issues described above don't change this at all.

理論上新版應該會省一點電了?

Percona 對 mysql_query_cache 的測試 (以 Magento 為例)

Percona 的人以現在的觀點來看 mysql_query_cache:「The MySQL query cache: Worst enemy or best friend?」。

起因主要也是懷疑 query cache 是 global mutex 在現在的硬體架構 (主要是 CPU 數量成長) 應該是個負面的影響,但不確定影響多少:

The query cache is well known for its contentions: a global mutex has to be acquired for any read or write operation, which means that any access is serialized. This was not an issue 15 years ago, but with today’s multi-core servers, such serialization is the best way to kill performance.

這邊就有點怪了,PK search 應該是個位數 ms 等級才對 (一般 EC 網站的資料量都應該可以 memory fit),不知道他是怎麼測的:

However from a performance point of view, any query cache hit is served in a few tens of microseconds while the fastest access with InnoDB (primary lookup) still requires several hundreds of microseconds. Yes, the query cache is at least an order of magnitude faster than any query that goes to InnoDB.

anyway,他實際測試兩個不同的 configuration,首先是打開 query cache 的:

再來是關閉 query cache 的:

測試的方式在原文有提到,這邊就不抄過來了。測試的結果可以看到關閉 query cache 得到比較好的 thoughput:

Throughput scales well up to somewhere between 10 and 20 threads (for the record the server I was using had 16 cores). But more importantly, even at the higher concurrencies, the overall throughput continued to increase: at 20 concurrent threads, MySQL was able to serve nearly 3x more queries without the query cache.

跟預期的差不多,硬體的改變使得演算法也必須跟著改,不然就會遇到問題...

CloudFlare 加速 zlib library

CloudFlare 大幅改善 zlib,使得速度相較原來的版本快了許多:「Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code」。

改變的部份主要是 CPU 指令集的特性,以及 longest-match function 的改善。可以看出不同的測試樣本 (corpus) 在壓縮率沒有變差的情況下,大幅改善了速度。

Calgary corpus

Canterbury corpus

Large corpus

Silesia corpus

速度快非常多,跟 Google 以壓縮率為導向而放出來的 zopfli 剛好是兩個極端:「Google 發表與 zlib/deflate 相容的壓縮程式,再小 5%...」。

Flickr 用 GPU 產生縮圖

Flickr 說明了用 GPU 產生縮圖的技術:「Real-time Resizing of Flickr Images Using GPUs」。

這邊可以看到是透過 apache module 再轉到 resize worker:

成果就是讓縮圖的產生時間縮短很多,以 2048px 轉到 1600px 為例,速度快了 15 倍,即使是與 Yahoo 自己最佳化過的版本相比,也快了 10 倍:

而 GPU-based 的架構每台 server 可以支撐 300 resizes/sec:

Equally noteworthy, at peak load, each resize server can perform over 300 resizes per second.

支援多 CPU 的 ab:wrk

在「wrk」這邊看到 wrk 這個工具:「Modern HTTP benchmarking tool」。

利用 multi-threading 與 epoll/kqueue 撐出效能:

wrk is a modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU. It combines a multithreaded design with scalable event notification systems such as epoll and kqueue.