Google Compute Engine 推出 Custom Machine Type

Google Compute Engine 推出了可以自己設定 CPU 與 RAM 的機器種類:「Custom Machine Types - Compute Engine — Google Cloud Platform」。

可以從 1 個 vCPU 到 32 個 vCPU,而記憶體最多是 6.5GB * vCPU 數,所以理論上最高是 208GB?

Create a machine type with as little as 1 vCPU and up to 32 vCPUs, or any even number of vCPUs in between. Memory can be configured up to 6.5 GB of RAM per vCPU.

計價方式就是 vCPU 算一份,記憶體算一份。記得以前有比較小的 Cloud Service 有提供過類似的計價方式,後來都收掉了...

MySQL 上 InnoDB 與 RocksDB 的差異

Mark Callaghan 在「MyRocks vs InnoDB with Linkbench over 7 days」這邊分析了 MySQL 上的兩個 engine 的差異,總結是大獲全勝:

  • InnoDB writes between 8X and 14X more data to SSD per transaction than RocksDB
  • RocksDB sustains about 1.5X more QPS
  • Compressed/uncompressed InnoDB uses 2X/3X more SSD space than RocksDB

但 InnoDB 的 engine 是 2000 年的設計 (16 年前),MyRocks 的 engine (RocksDB) 則是 2013 年的設計,不屬於同一個世代。

相較於與 InnoDB 對打,我更想看到的是與 TokuDB (2012 年的設計) 對打的結果。

兩個新的 engine 都有針對 SSD 的特性發展,可以看出資料結構與寫入的方式就很不一樣,而且一開始就是在多 CPU 多核環境下開發,相較於 InnoDB 是一路改的包袱來的輕鬆許多。

Facebook 更新 iOS 應用程式,修正吃電問題

在「在 iOS 上不使用 Facebook App 時要完全砍掉 process」這邊提到了 Facebook 在 iOS 版的應用程式會在背景播放無聲音樂,導致吃電特別兇的問題,Facebook 的 Ari Grant 出來澄清是 bug 造成的,而非故意行為。

修正了兩個 bug,第一個是 network code 的部分:

The first issue we found was a “CPU spin” in our network code. A CPU spin is like a child in a car asking, “Are we there yet? Are we there yet? Are we there yet?”with the question not resulting in any progress to reaching the destination. This repeated processing causes our app to use more battery than intended. The version released today has some improvements that should start making this better.

第二個則是之前提到無聲 audio 的問題:

The second issue is with how we manage audio sessions. If you leave the Facebook app after watching a video, the audio session sometimes stays open as if the app was playing audio silently. This is similar to when you close a music app and want to keep listening to the music while you do other things, except in this case it was unintentional and nothing kept playing. The app isn't actually doing anything while awake in the background, but it does use more battery simply by being awake. Our fixes will solve this audio issue and remove background audio completely.

同時澄清並沒有要在背景更新取得地理位置資訊:

The issues we have found are not caused by the optional Location History feature in the Facebook app or anything related to location. If you haven't opted into this feature by setting Location Access to Always and enabling Location History inside the app, then we aren't accessing your device's location in the background. The issues described above don't change this at all.

理論上新版應該會省一點電了?

Percona 對 mysql_query_cache 的測試 (以 Magento 為例)

Percona 的人以現在的觀點來看 mysql_query_cache:「The MySQL query cache: Worst enemy or best friend?」。

起因主要也是懷疑 query cache 是 global mutex 在現在的硬體架構 (主要是 CPU 數量成長) 應該是個負面的影響,但不確定影響多少:

The query cache is well known for its contentions: a global mutex has to be acquired for any read or write operation, which means that any access is serialized. This was not an issue 15 years ago, but with today’s multi-core servers, such serialization is the best way to kill performance.

這邊就有點怪了,PK search 應該是個位數 ms 等級才對 (一般 EC 網站的資料量都應該可以 memory fit),不知道他是怎麼測的:

However from a performance point of view, any query cache hit is served in a few tens of microseconds while the fastest access with InnoDB (primary lookup) still requires several hundreds of microseconds. Yes, the query cache is at least an order of magnitude faster than any query that goes to InnoDB.

anyway,他實際測試兩個不同的 configuration,首先是打開 query cache 的:

再來是關閉 query cache 的:

測試的方式在原文有提到,這邊就不抄過來了。測試的結果可以看到關閉 query cache 得到比較好的 thoughput:

Throughput scales well up to somewhere between 10 and 20 threads (for the record the server I was using had 16 cores). But more importantly, even at the higher concurrencies, the overall throughput continued to increase: at 20 concurrent threads, MySQL was able to serve nearly 3x more queries without the query cache.

跟預期的差不多,硬體的改變使得演算法也必須跟著改,不然就會遇到問題...

CloudFlare 加速 zlib library

CloudFlare 大幅改善 zlib,使得速度相較原來的版本快了許多:「Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code」。

改變的部份主要是 CPU 指令集的特性,以及 longest-match function 的改善。可以看出不同的測試樣本 (corpus) 在壓縮率沒有變差的情況下,大幅改善了速度。

Calgary corpus

Canterbury corpus

Large corpus

Silesia corpus

速度快非常多,跟 Google 以壓縮率為導向而放出來的 zopfli 剛好是兩個極端:「Google 發表與 zlib/deflate 相容的壓縮程式,再小 5%...」。

Flickr 用 GPU 產生縮圖

Flickr 說明了用 GPU 產生縮圖的技術:「Real-time Resizing of Flickr Images Using GPUs」。

這邊可以看到是透過 apache module 再轉到 resize worker:

成果就是讓縮圖的產生時間縮短很多,以 2048px 轉到 1600px 為例,速度快了 15 倍,即使是與 Yahoo 自己最佳化過的版本相比,也快了 10 倍:

而 GPU-based 的架構每台 server 可以支撐 300 resizes/sec:

Equally noteworthy, at peak load, each resize server can perform over 300 resizes per second.

支援多 CPU 的 ab:wrk

在「wrk」這邊看到 wrk 這個工具:「Modern HTTP benchmarking tool」。

利用 multi-threading 與 epoll/kqueue 撐出效能:

wrk is a modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU. It combines a multithreaded design with scalable event notification systems such as epoll and kqueue.

在瀏覽器上面用 JavaScript 進行 Side-channel attack

用 JavaScript 就可以攻擊 L3 cache,進而取得資料:「JavaScript CPU cache snooper tells crooks EVERYTHING you do online」。

論文出自「The Spy in the Sandbox – Practical Cache Attacks in Javascript」(PDF) 這篇。

不需要任何外掛或 exploit,就純粹是利用 cache 反應時間的 side-channel attack。另外由於 AMD 的 cache 架構不同,這次的攻擊實作僅對 Intel 有效:

The Intel cache micro-architecture isinclusive– all elements in the L1 cache must also exist in the L2 and L3 caches. Conversely, if a memory element is evicted fromthe L3 cache, it is also immediately evicted from the L2 and L1 cache. It should be noted that the AMD cachemicro-architecture is exclusive, and thus the attacks described in this report are not immediately applicable tothat platform.

這次的攻擊方法真變態...

用 Go 寫的 Tor Relay Server

Zite 上看到的「Implementing a Tor relay from scratch」,用 Go 寫的 Tor Relay Server。

會跳下去用 Go 寫是因為效能上的考量:

[...], but the lack of AES-NI instructions on the CPUs cause a significant slowdown.

但因為一個 IP 只能跑兩個 instance,這就有點痛了:

To maximize the amount of relayed data, it is normal to simply run multiple instances of the program, up to two per IP address.

而作者的目標是超過現有的極限:

My final goal was to beat the Tor speed record, which was at roughly 200 megabytes per second.

成果就是直接吃滿 2Gbps (250MBytes/sec),而且 CPU 只用了 60%:

[...] He set up a server with my relay and within a few days we had broken the Tor speed record with a nice 250 megabytes per second, effectively maxing out the network link. CPU usage was at a nice 60% across 12 cores. But his relay also suffered from the memory issues and had to be restarted every few days.

作者的程式碼放在 GitHub 上,之後應該會有人跳進去改寫:「A fast implementation of the Tor OR code, in Go」。

Amazon EC2 的 C4 Instance

這時候就要拿出經典台詞:

Anyway,Amazon EC2 推出 C4 Instance 了:「Now Available – New C4 Instances」。

來看看 c4.8xlarge,最近說不定用的上,不過要注意的是,C4 Instance 只支援 EBS only。

c4.8xlarge 的 ECU 是 132 (USD$1.856/hour) 而 c3.8xlarge 是 108 (USD$1.68/hour),跟 C3 相比,看起來更實惠一些。