optimization – Page 2 – Gea-Suan Lin's BLOG

V8 version 6.5 (Chrome 65) 的改變

V8 version 6.5 將會有不少改變：「V8 release v6.5」。

其中因為 Spectre 的關係，新的 V8 設計了 Untrusted code mode，拿來跑不信任的程式，裡面會設計反制措施。而且這在新版的 Chrome 將會預設開啟：

In response to the latest speculative side-channel attack called Spectre, V8 introduced an untrusted code mode. If you embed V8, consider leveraging this mode in case your application processes user-generated, not-trustworthy code. Please note that the mode is enabled by default, including in Chrome.

另外是針對 WebAssembly 提供邊下載邊 compile 的能力，這讓速度大幅提昇。在原文是拿一個比較大包的 WebAssembly 來測試：

For the graph below we measure the time it takes to download and compile a WebAssembly module with 67 MB and about 190,000 functions. We do the measurements with 25 Mbit/sec, 50 Mbit/sec, and 100 Mbit/sec download speed.

可以看到網路不夠快的使用者就會直接被 compile 速度跟上，讓瀏覽器在下載時就做一些事情。

另外在某些情況下對 Array 的操作會有大幅改善：

這些新功能與改善都會在 Chrome 65 推出。依照「Chrome Platform Status」這邊的資料，stable 版預定在三月初，beta 版應該是要出了... (雖然上面寫著 2/1，但目前好像還沒更新)

Amazon Aurora (MySQL) 推出的 Asynchronous Key Prefetch

Amazon Aurora (MySQL) 推出新的效能改善，可以改善 JOIN 時的效能：「Amazon Aurora (MySQL) Speeds Join Queries by More than 10x with Asynchronous Key Prefetch」。

看起來像是某個情況的 optimization，將可能的 random access 換成 sequential access 而得到大量的效能：

This feature applies to queries that require use of the Batched Key Access (BKA) join algorithm and Multi-Range Read (MRR) optimization, and improves performance when the underlying data set is not in the main memory buffer pool or query cache.

其實記憶體還是最好用的加速器，能加大硬拼就先硬拼... XD

Amazon 的多變數最佳化

在「An efficient bandit algorithm for real-time multivariate optimization」這邊提到了 Amazon 不是走傳統的 A/B testing，而是同時進行多變數的最佳化：

Consider the problem of trying to find a near-optimal version of a promotional message such as this one, which has 5 variable parts and 48 different combinations in total.

在這樣的測試數量下，作者預估需要 66 天才能夠得到有效的結果，而這也表示當變數更多的時候問題就更大了：

Based on the Amazon success rate and traffic size, the authors calculated it would take 66 days to conduct a 48 treatment randomized experiment. Often this isn’t practical.

也就是開頭提到的，如何一個禮拜就提昇 21% conversion rate：

Aka, “How Amazon improved conversion by 21% in a single week!”

作者也提到了這個方法其實打臉了他先前提到的另外一篇論文，在 2014 年提出測試應該要盡可能簡單 XDDD：

Yesterday we saw the hard-won wisdom on display in ‘seven myths‘ recommending that experiments be kept simple and only test one thing at a time, otherwise interpreting the results can get really complicated.

只能說狀況愈來愈複雜，導致需要新方法解決問題。而且這些電商會遇到在測試時不同的 factor 之間有可能會有相依性 (也就是說這些 factor 不是 i.i.d.)，你用本來的方式反而會測不出來。

Heimdall Data：自動 Cache RDBMS 資料增加效能

看到 AWS 的「Automating SQL Caching for Amazon ElastiCache and Amazon RDS」這篇裡面介紹了 Heimdall Data – SQL caching and performance optimization 這個產品。

從官網的介紹也可以看出來是另外疊一層 proxy，但自動幫你處理 cache invalidation 的問題：

But what makes Heimdall Data unique in industry is its auto-cache AND auto-invalidation capability. Our machine learning algorithms determine what queries to cache while invalidating to ensure maximum performance and data integrity.

看起來支援了四個蠻常見的 RDBMS：

Heimdall Data supports most all relational database (e.g. MySQL, Postgres, Amazon RDS, Oracle, SQL Server, MariaDB).

看起來是一個花錢直接買效能的方案... 不過 cache invalidation 的部分不知道要怎麼跨機器做，在 FAQ 沒看到 cluster 情況下會怎麼解決。

C++ 與組語的速度...

在 Hacker News Daily 上看到「Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?」覺得很有趣...

作者寫了一段 assembly，但跑起來比用 C++ 同義的版本慢多了。目前最高分的答案給了很清楚的解釋...

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

上面這段 code 是作者寫的組語版本，用到 div 指令，這是非常慢的指令：

On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles.

相較於 C++ 的版本，用到的是 shr (logical shift right，以位元方式往右平移，最高位補零)，速度快太多：

shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.

這是用到無號整數透過 shr 平移一格剛好是除以二的特性，因為速度的關係，這個用法到現在還是很常被拿來用，但對於平常沒在寫 assembly 的人就會有上面的誤解 XDDD

Webkit 推出 B3 JIT Compiler (Bare Bones Backend)

Webkit 推出了 B3 加快 optimization 的速度，取代原來 LLVM 的工作：「Introducing the B3 JIT Compiler」。

在文章後方 Performance Results 的部份可以看到最主要的差異在啟動時間：

另外也可以看到其他各種 performance benchmark 也幾乎都是小勝 LLVM。

接下來會有 ARM64 與其他平台的計畫：

B3 is not yet complete. We still need to finish porting B3 to ARM64. B3 passes all tests, but we haven’t finished optimizing performance on ARM. Once all platforms that used the FTL switch to B3, we plan to remove LLVM support from the FTL JIT.

減少「註解長度」增加 Node.js 效率...

在「#NodeJS : A quick optimization advice」這邊看到這樣的效能改善方法... 兩段程式碼，只差在註解：

效能差了 50%：

只是因為註解的長度有差，只要用 --max-inlined-source-size 調整就可以避開了：

超苦超無奈：

So when you have a function or callback that’ll be called repeatedly, try to make it under 600 characters (or your tweaked value), you’ll have a quick win !

關於 KeyCDN 的 HLS streaming 最佳化...

KeyCDN 發表了對 HLS streaming 的最佳化：「New feature: Optimized HLS streaming」。

其中這段看起來就很奇怪：

The index file (.m3u8) will not be cached at all. The .ts files will only be cached for 5 minutes. If the origin server sends other Cache Control headers, it will be ignored by the CDN.

也就是這個畫面：

如果你把對 .m3u8 的壓力全部打到後端，那麼就註定不 scale 啊？之前在 EC2 c3.8xlarge 上面用 Wowza 測試，就發現單台最多只能承受 4000 reqs/sec。

比較好的作法應該是 cache 很短的時間 (也許三到五秒)，讓 CDN 幫你擋下來，而不是前面打多少 reqs/sec 後面就吃多少...

HTTPS 的進展

Tony Hunt 在「We’re struggling to get traction with SSL because it’s still a “premium service”」這篇文章裡抱怨了目前 web 要朝向 HTTPS only 還很遠，甚至還酸了一下 Let's Encrypt 冨樫問題：

可是東尼... 你的站也沒上 HTTPS 啊 :/

順便整理一下目前 HTTPS 技術發展出來的優點：

Google 在 2014 年 6 月就直接說 HTTPS 對 SEO 有幫助：「HTTPS as a ranking signal」。
「HTTPS + SPDY」或是「HTTPS + HTTP/2」的速度比「HTTP/1.1，沒有 SSL/TLS」快，可以參考「A Simple Performance Comparison of HTTPS, SPDY and HTTP/2」這邊的說明。原因是 HTTP/1.1 以及 HTTP/1.1 + SSL/TLS 都不支援 multiplexing 技術，而這個技術讓 SPDY 與 HTTP/2 的速度快很多。
續上條，可以參考「HTTP vs HTTPS Test」這邊的測試，這邊的 HTTPS 伺服器使用了 SPDY，當小圖非常多的時候會很明顯。

現在網站的 best practice 是 HTTPS + HTTP/2，對 SEO 好、速度又快 (這兩個對營收有影響)，而另外也可以增加安全性 (對聲譽有幫助)。

增加一行程式碼讓 PHP Composer 效率爆增

可以直接看 GitHub 上的 commit log：「Disable GC when computing deps, refs #3482」。

      */
     public function run()
     {
+        gc_disable();
+
         if ($this->dryRun) {
             $this->verbose = true;
             $this->runScripts = false;

下面變成祭典了 XDDD

然後依照「Avoid generating duplicate conflict rules by naderman · Pull Request #3482 · composer/composer」這邊的測試 (要看下面的討論)，設 zend.enable_gc=0 會省的更多，效率用倍數在跳的...

因為 Composer 的效率為人詬病很久了，這次有人暴發出來，會讓一群人投入資源找更多 optimization... XD