Instagram 改善影片上架速度的方式

不是什麼魔法,其實是改產品面上的規格 (但是發表到 Instagram Engineering 上):「Video Upload Latency Improvements at Instagram」。

最原始的版本是所有的格式都轉完後才可以上架:

然後把規格改成最高畫質的版本轉完後就可以先上架:

The idea is, instead of blocking until all video versions are available, we can publish the video once the highest-quality video version is available.

然後是把影片切段上傳,所以傳一半就可以先處理一半,變成 pipeline 的概念,但增加程式的複雜度,以及被迫要調整影片品質的參數:

Segmented uploads reduce upload latency in many cases but come with a few tradeoffs. For instance, segmented uploads increase the complexity of the pipeline. There are some quality metrics that are only available per segment at transcode time, such as SSIM. These metrics are not helpful to us on a per segment basis. Therefore, we need to do a duration weighted average of the SSIM of all segments to come up with the SSIM of the whole video. Similarly, handling exceptions is more complex since there are more cases to handle.

另外有一種特例是上傳的影片本身就已經符合伺服器的規格,這樣的話可以直接放行 (不過這樣不會有 security concern 嗎...):

Another performance optimization we use to improve the upload latency and save CPU utilization is something we call a “passthrough” upload. In some cases, the media that is uploaded is already ready for playback on most devices.

都是想的出來而且會帶有 tradeoff 的方法,而不是完全正面的改善 :o

Twitter 對 2x 與 3x 的圖片的研究...

所以發現很多時候用 2x 的圖片就夠了?:「Capping image fidelity on ultra-high resolution devices」。

會這樣討論主要是發現螢幕特性:

The most modern screens are OLED. These screens boast some really great features like pure blacks, and are marketed as 3x scale. However, nearly no "3x scale" OLED actually has perfect 3x3 pixels per dot on their screen.

因為螢幕不是真的到 3x 的要求,丟 2x 的圖片出去就好,省頻寬又省下載時間:

This means that most OLED screens that say they are 3x resolution, are actually 3x in the green color, but only 1.5x in the red and blue colors. Showing a 3x resolution image in the app vs a 2x resolution image will be visually the same, though the 3x image takes significantly more data. Even true 3x resolution screens are wasteful as the human eye cannot see that level of detail without something like a magnifying glass.

省下 38% 的資料量,32% 的時間:

There's no difference that the human eye can see, but will save 38% on data and 32% on latency on the capped image load for this particular example which is reflective of most images that load on Twitter.

這也另外帶出了其他的想法,如果沒有太多時間研究的話,可以考慮先提供 2x 的就好,不需要特地做 3x 的版本...

Cloudflare 改善了同時下載同一個檔案的效率...

在「Live video just got more live: Introducing Concurrent Streaming Acceleration」這邊 Cloudflare 說明他們改善了同時下載同一個 cache-miss 檔案時的效率。

本來的架構在 cache-miss 時,CDN 這端會先取得 exclusive lock,然後到 origin server 抓檔案。如果這時候有其他人也要抓同一個檔案,就會先卡住,避免同時間對 origin server 產生大量連線:

這個架構在一般的情況下其實還 ok,就算是 Windows Update 這種等級的量,畢竟就是一次性的情況而已。但對於現代愈來愈普及的 HTTP(S) streaming 技術來說,因為檔案一直產生,這就會是常常遇到的問題了。

由於 lock 機制增加了不少延遲,所以在使用者端就需要多抓一些緩衝時間才能確保品質,這增加了直播的互動延遲,所以 Cloudflare 改善了這個部分,讓所有人都可以同時下載,而非等到發起的使用者下載完才能下載:

沒有太多意外的,從 Cloudflare 內部數字可以看出來這讓 lock 時間大幅下降,對於使用者來說也大幅降低了 TTFB (time to first byte):

不確定其他家的情況...

單機 10 萬個連線 MySQL

也是在「Links: February 2019」這邊看到的,裡面提到了 Percona 的「MySQL Challenge: 100k Connections」。

Percona 的測試是希望每個連線都有在做事,而不是 idle connection,這個測試有點像是卡住時的情況?看起來只有這幾個參數比較特別:

table_open_cache = 200000
back_log=3500
max_connections=110000
max_prepared_stmt_count=1000000

max_connections 開多一點算是廢話,然後因為要做事所以 max_prepared_stmt_count 也多一些,back_log 可以讓 kernel 保留來不及處理的 TCP 連線。

看起來用 sysbench 測試還撐的住,跟理論差不多,隨著連線數的增加 latency 也會增加...

Amazon EFS 的 IA Storage Class

Amazon EFS 一直找不太到好用的使用情境,因為 NFS 的關係所以大量 I/O 時的 latency issue 使得速度快不起來,而拿來堆 log 的成本又超級高...

最新推出的 storage class 則是透過提供低儲存成本的版本,解決了堆 log 這種使用情境:「New – Infrequent Access Storage Class for Amazon Elastic File System (EFS)」。

不過 EFS 不像 S3 可以直接選擇 storage class,是需要讓系統管理的:

開啟後 30 天沒有被碰過的檔案就會切過去:

Eligible Files – Files that are 128 KiB or larger and that have not been accessed or modified for at least 30 days can be transitioned to the new storage class. Modifications to a file’s metadata that do not change the file will not delay a transition.

而 latency 也會增加:

Files that have not been read or written for 30 days will be transitioned to the Infrequent Access storage class with no further action on your part. Files in the Standard Access class can be accessed with latency measured in single-digit milliseconds; files in the Infrequent Access class have latency in the double-digits.

us-east-1 為例子來說,Standard 是 USD$0.3/GB-month,而 IA 只要 USD$0.045/GB-month,但抓取時會有 USD$0.01/GB 的傳輸費用,可以看出價錢低不少。

不過文章裡沒提到什麼時候會把資料從 IA 跳回 Standard,可能得找機會問問看...

HiNet 與 DigitalOcean、Linode、Vultr 的封包情況

先說結論,綜合網路與 CPU 的情況,我剛好跟下面提到的文章給出相反的選擇 (i.e. 完全不會選 DigitalOcean)。如果是需要 latency 低的品質我會選 Linode 的東京新機房 Tokyo 2,如果不需要 latency 的我會選 Vultr 的 USD$2.5/month 方案 (目前只在邁阿密與紐約有)。

看到「2018/06 台灣 5USD 虛擬主機網路延遲測試」這篇就來推廣一下 SmokePing 這個工具。這個工具可以做很多事情,但最常看到的用途還是做網路品質監控,先前在 K 社的時候就有個做個公開的站台可以看,後來接手的人也繼續維護著 (畢竟看這些圖有種治癒感?):「smokeping.kkbox.com.tw」。

不過 K 社的 SmokePing 裡面大多數是從固網機房端監控,而固網機房端的 Internet 品質一般來說都會比家用型的好很多,尤其是國際頻寬的部份。所以我也在我家裡用 PPPoE 版本的固定 IP 做了一份:「https://home.gslin.org/smokeping/」,這邊的設定檔放在 GitHub 上的 gslin/smokeping-config.d 上。

而我剛好有把這三家 VPS 的 SmokePing 都做起來:「SmokePing Latency Page for DigitalOcean」、「SmokePing Latency Page for Linode」、「SmokePing Latency Page for Vultr」。

我這邊看到的情況是這樣。以各家離台灣最近的點來看:

  • 第一張圖的 DigitalOcean 沒有東京的點,而新加坡的 latency 在這幾個月其實變差不少,現在大約要 90ms (扣掉光世代的 10ms)。
  • 第二跟第三張圖的 Linode (分別是 Tokyo 1 與 Tokyo 2) 其實可以看到新機房 Tokyo 2 的 latency 比舊機房 Tokyo 1 還好。
  • 第四張圖的 Vultr 則是狀況變化很多,但不管怎麼走,latency 大致上都還是比新加坡好。

另外第五張的 Vultr 則是紐約的點,latency 超高 (畢竟繞了半個地球),但 packet loss 不高,品質還算穩定。


speedtest-sgp1.digitalocean.com (DigitalOcean Singapore 1)


speedtest.tokyo.linode.com (Linode Tokyo)


speedtest.tokyo2.linode.com (Linode Tokyo 2)


hnd-jp-ping.vultr.com


nj-us-ping.vultr.com

另外是之前有痛到的部份,先前因為需求而需要在 PHP 5.6 上跑 WordPress,真的實際跑起來後發現超慢 (畢竟這兩個要快得想不少辦法),去找問題後發現 DigitalOcean 機器的 CPU 真的太慢,後來把這組需求搬去 Linode (在 CPU 與網路之間取個合理的平衡點)。

在各家 VPS 上用 Ubuntu 16.04 跑 openssl speed md5 可以看出一些資料:

DigitalOcean:

Doing md5 for 3s on 16 size blocks: 5465798 md5's in 3.00s
Doing md5 for 3s on 64 size blocks: 3761125 md5's in 3.00s
Doing md5 for 3s on 256 size blocks: 1835218 md5's in 2.99s
Doing md5 for 3s on 1024 size blocks: 582162 md5's in 2.96s
Doing md5 for 3s on 8192 size blocks: 102995 md5's in 2.97s
Doing md5 for 3s on 16384 size blocks: 47177 md5's in 2.99s

Linode:

Doing md5 for 3s on 16 size blocks: 11510700 md5's in 3.00s
Doing md5 for 3s on 64 size blocks: 8361353 md5's in 2.99s
Doing md5 for 3s on 256 size blocks: 3751929 md5's in 3.00s
Doing md5 for 3s on 1024 size blocks: 1169457 md5's in 3.00s
Doing md5 for 3s on 8192 size blocks: 157678 md5's in 2.99s
Doing md5 for 3s on 16384 size blocks: 78874 md5's in 3.00s

Vultr (這是 USD$2.5/month 的方案):

Doing md5 for 3s on 16 size blocks: 14929209 md5's in 2.97s
Doing md5 for 3s on 64 size blocks: 9479563 md5's in 2.97s
Doing md5 for 3s on 256 size blocks: 4237907 md5's in 2.98s
Doing md5 for 3s on 1024 size blocks: 1320548 md5's in 2.98s
Doing md5 for 3s on 8192 size blocks: 161940 md5's in 2.96s
Doing md5 for 3s on 16384 size blocks: 86592 md5's in 2.98s

然後補一個 AWS 的 t2.nano (在還有 CPU credit 可以全速跑的情況下),不過這不公平,參考用而已:

Doing md5 for 3s on 16 size blocks: 19257426 md5's in 3.00s
Doing md5 for 3s on 64 size blocks: 11168752 md5's in 2.99s
Doing md5 for 3s on 256 size blocks: 4959879 md5's in 3.00s
Doing md5 for 3s on 1024 size blocks: 1518690 md5's in 3.00s
Doing md5 for 3s on 8192 size blocks: 203910 md5's in 3.00s
Doing md5 for 3s on 16384 size blocks: 102321 md5's in 2.99s

Cloudflare 推出 1.1.1.1 的 DNS Resolver 服務

Cloudflare 推出了 1.1.1.1 上的 DNS Resolver 服務:「Announcing 1.1.1.1: the fastest, privacy-first consumer DNS service」,主打項目是隱私以及效能。

然後因為這個 IP 的特殊性,上面有不少奇怪的流量... 而 Cloudflare 跟 APNIC 交換條件後取得這個 IP address 的使用權 (然後 anycast 發出去):

APNIC's research group held the IP addresses 1.1.1.1 and 1.0.0.1. While the addresses were valid, so many people had entered them into various random systems that they were continuously overwhelmed by a flood of garbage traffic. APNIC wanted to study this garbage traffic but any time they'd tried to announce the IPs, the flood would overwhelm any conventional network.

We talked to the APNIC team about how we wanted to create a privacy-first, extremely fast DNS system. They thought it was a laudable goal. We offered Cloudflare's network to receive and study the garbage traffic in exchange for being able to offer a DNS resolver on the memorable IPs. And, with that, 1.1.1.1 was born.

Cloudflare 做了效能比較表 (與 Google Public DNSOpenDNS 比較),可以看到平均速度快不少:

在台灣的話,HiNet 非固定制 (也就是 PPPoE 連線的使用者) 連到 8.8.8.8 有奇怪的 latency:

可以比較同一台機器對 168.95.1.1 的反應速度:

不過如果你是 HiNet 固定制 (固 2 或是固 6 IP 那種,不透過 PPPoE,直接設定 IP address 使用 bridge mode 連線的使用者),兩者的 latency 就差不多,不知道是 Google 還是 HiNet 的架構造成的。

另外比較奇怪的一點是在文章最後面提到的 https://1.1.1.1/,理論上不會發 IP-based 的 SSL certificate 才對?不知道 CEO 老大是有什麼誤解... XD

Visit https://1.1.1.1/ from any device to get started with the Internet's fastest, privacy-first DNS service.

Update:查了資料發現是可以發的,只是大多數的 CA 沒有提供而已...

Instagram 解決 Cassandra 效能問題的方法

在解決 Cassandra 效能問題中大概就 ScyllaDB 特別有名,用 C++ 重寫一次使得效能大幅改善。而 Instagram 的人則是把底層的資料結構換掉,改用 RocksDB (這公司真的很愛自家的 RocksDB...):「Open-sourcing a 10x reduction in Apache Cassandra tail latency」。

主要原因是他們發現 Cassandra 在處理資料的部份會有 JVM 的 GC 問題,而且是導致 Cassandra 效能差的主要原因:

Apache Cassandra is a distributed database with it’s own LSM tree-based storage engine written in Java. We found that the components in the storage engine, like memtable, compaction, read/write path, etc., created a lot of objects in the Java heap and generated a lot of overhead to JVM.

然後在換完後測試可以看到效能大幅提昇,也可以看到 GC 的延遲大幅降低:

In one of our production clusters, the P99 read latency dropped from 60ms to 20ms. We also observed that the GC stalls on that cluster dropped from 2.5% to 0.3%, which was a 10X reduction!

比較一下這兩者的差異:在 ScyllaDB 是全部都用 C++ 改寫 (資料結構不換),這樣就直接解決掉 JVM 的 GC 問題。在 Rocksandra 則是在 profiling 後挑重點換掉 (這邊看起來是處理資料的 code,直接換成 RocksDB),另外順便把一些界面抽象化... 兩個不一樣的解法,都解決了 JVM 的 GC 問題。

AWS Media Services 推出一卡車與影音相關的服務...

AWS 推出了一連串 AWS Elemental MediaOOXX 一連串影音相關的服務:「AWS Media Services – Process, Store, and Monetize Cloud-Based Video」。

但不是所有的服務都是相同的區域... 公告分別在:

不過這邊還是引用 Jeff Barr 文章裡的說明,可以看到從很源頭的 transencoding 到 DRM,以及 Live 格式,到後續的檔案儲存及後製 (像是上廣告) 都有:

AWS Elemental MediaConvert – File-based transcoding for OTT, broadcast, or archiving, with support for a long list of formats and codecs. Features include multi-channel audio, graphic overlays, closed captioning, and several DRM options.

AWS Elemental MediaLive – Live encoding to deliver video streams in real time to both televisions and multiscreen devices. Allows you to deploy highly reliable live channels in minutes, with full control over encoding parameters. It supports ad insertion, multi-channel audio, graphic overlays, and closed captioning.

AWS Elemental MediaPackage – Video origination and just-in-time packaging. Starting from a single input, produces output for multiple devices representing a long list of current and legacy formats. Supports multiple monetization models, time-shifted live streaming, ad insertion, DRM, and blackout management.

AWS Elemental MediaStore – Media-optimized storage that enables high performance and low latency applications such as live streaming, while taking advantage of the scale and durability of Amazon Simple Storage Service (S3).

AWS Elemental MediaTailor – Monetization service that supports ad serving and server-side ad insertion, a broad range of devices, transcoding, and accurate reporting of server-side and client-side ad insertion.

引個前同事的 tweet,先不說 Amazon SWF 的情況 (畢竟 Amazon SWF 還可以找到其他用途),倒是 Amazon Elastic Transcoder 很明顯要被淘汰掉了:

這種整個大包的東西是 AWS re:Invent 才有的能量,平常比較少看到...

Cloudflare 的 F-Root

Cloudflare 從三月底開始跟 ISC 簽約合作,服務 F-Root 這個 DNS Service (f.root-servers.net):「Delivering Dot」。

Since March 30, 2017, Cloudflare has been providing DNS Anycast service as additional F-Root instances under contract with ISC (the F-Root operator).

Linode 東京的機器上面可以看出來 www.cloudflare.com 走的路徑跟 f.root-server.net 相同:

gslin@one [~] [22:49] mtr -4 --report www.cloudflare.com
Start: Tue Sep 12 22:49:29 2017
HOST: one.abpe.org                Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 139.162.65.2               0.0%    10    0.6   0.6   0.5   0.6   0.0
  2.|-- 139.162.64.5               0.0%    10    2.0   1.1   0.6   2.5   0.5
  3.|-- 139.162.64.8               0.0%    10    0.7   1.0   0.7   2.1   0.3
  4.|-- 218.100.6.62               0.0%    10    0.8   0.8   0.8   1.0   0.0
  5.|-- 198.41.215.162             0.0%    10    0.7   0.7   0.7   0.8   0.0
gslin@one [~] [22:49] mtr -4 --report f.root-servers.net
Start: Tue Sep 12 22:49:46 2017
HOST: one.abpe.org                Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 139.162.65.3               0.0%    10    0.5   0.6   0.5   0.6   0.0
  2.|-- 139.162.64.7               0.0%    10    0.7   0.7   0.6   0.8   0.0
  3.|-- 139.162.64.8               0.0%    10    0.7   0.7   0.6   0.8   0.0
  4.|-- 218.100.6.62               0.0%    10    0.8   0.8   0.8   0.8   0.0
  5.|-- f.root-servers.net         0.0%    10    0.8   0.8   0.7   0.8   0.0

而且也可以從監控發現,f.root-servers.net 的效能變好:

Using RIPE atlas probe measurements, we can see an immediate performance benefit to the F-Root server, from 8.24 median RTT to 4.24 median RTT.

DNS query 的量也大幅增加:

而且之後也會隨著 Cloudflare 的 PoP 增加而愈來愈快... 在原文的 comment 也提到了 Cloudflare 也有打算跟其他的 Root Server 合作,所以看起來會讓整個 infrastructure 愈來愈快而且穩定。

另外這也代表台灣在本島也會直接連到 F-Root 了,不過 HiNet 自己也有 F-Root,所以 HiNet 的部份就沒什麼差...