Stripe 遇到 AWS 上 DNS Resolver 的限制

當量夠大就會遇到各種限制...

這次 Stripe 在描述 trouble shooting 的過程:「The secret life of DNS packets: investigating complex networks」。

其中一個頗有趣的架構是他們在每台主機上都有跑 Unbound,然後導去中央的 DNS Resolver,再決定導去 Consul 或是 AWS 的 DNS Resolver:

Unbound runs locally on every host as well as on the DNS servers.

然後他們發現偶而會有大量的 SERVFAIL

接下來就是各種找問題的過程 (像是用 tcpdump 看情況,然後用 iptables 統計一些數字),最後發現是卡在 AWS 的 DNS Resolver 在 60 秒內只回應了 61,385 packets,換算差不多是 1,023 packets/sec,這數字看起來就很雷:

During one of the 60-second collection periods the DNS server sent 257,430 packets to the VPC resolver. The VPC resolver replied back with only 61,385 packets, which averages to 1,023 packets per second. We realized we may be hitting the AWS limit for how much traffic can be sent to a VPC resolver, which is 1,024 packets per second per interface. Our next step was to establish better visibility in our cluster to validate our hypothesis.

在官方文件「Using DNS with Your VPC」這邊看到對應的說明:

Each Amazon EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface. This limit cannot be increased. The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use. For more information and recommendations for a scalable DNS architecture, see the Hybrid Cloud DNS Solutions for Amazon VPC whitepaper.

iptables 看到的量則是:

找到問題後,後面就是要找方法解決了... 他們給了一個只能算是不會有什麼副作用的 workaround,不過也的確想不到太好的解法。

因為是查詢 10.0.0.0/8 網段反解產生大量的查詢,所以就在各 server 上的 Unbound 上指定這個網段直接問 AWS 的 DNS Resolver,不需要往中央的 DNS Resolver 問,這樣在這個場景就不會遇到 1024 packets/sec 問題了 XDDD

修改 Firefox 的 Pop Up 限制

這是在 Bazqux 使用時遇到的問題,我在 Bazqux 這邊看到想看的文章,會習慣用 c 鍵打開新的 tab,等下再一起看。在 Chrome 上面需要安裝 extension 才能開到背景,在 Firefox 上可以透過修改 about:config 裡的browser.tabs.loadDivertedInBackground,讓他預設開到背景 (雖然變成所有的行為都會到開到背景 tab,但我偏好這樣...)。

但在 Bazqux 用 c 鍵把連結打開到 tab 時,有時發現會跳出遇到開啟上限:

這其實很不方便... 所以故意連打測試發現是 20 個,找了一下資料發現是 dom.popup_maximum 這個值,改成 -1 就好了。這算是某種安全機制吧...

先繼續用看看 :o

Stripe 將 Redis 單機版轉到 Cluster 版本上降低了錯誤率

在「Scaling a High-traffic Rate Limiting Stack With Redis Cluster」這邊提到了 StripeRedis 單機版轉移到 10 個節點的 cluster 版本,然後錯誤率大幅下降:

Stripe’s rate limiters are built on top of Redis, and until recently, they ran on a single very hot instance of Redis. The server had followers in place for failover, but at any given time, one node was handling every operation.

We eventually solved it by migrating to a 10-node Redis Cluster.

另外也可以看出來,在轉移到 cluster 版本後有不少要注意的,像是因為 sharding 而需要調整平衡性。另外是 cluster 模式下寫入的 confirmation 跟一般預期的不太一樣,不過這對於 rate limit 的應用還好,可以接受某種程度的掉資料...

Amazon RDS 支援更大的硬碟空間與更多的 IOPS

Amazon RDS 的升級:「Amazon RDS Now Supports Database Storage Size up to 16TB and Faster Scaling for MySQL, MariaDB, Oracle, and PostgreSQL Engines」。

空間上限從 6TB 變成 16TB,而且可以無痛升。另外 IOPS 上限從 30K 變成 40K:

Starting today, you can create Amazon RDS database instances for MySQL, MariaDB, Oracle, and PostgreSQL database engines with up to 16TB of storage. Existing database instances can also be scaled up to 16TB storage without any downtime.

The new storage limit is an increase from 6TB and is supported for Provisioned IOPS and General Purpose SSD storage types. You can also provision up to 40,000 IOPS for Provisioned IOPS storage volumes, an increase from 30,000 IOPS.

不過隔壁的 Amazon Aurora 還是大很多啊 (64TB),而且實際上不用管劃多大,他會自己長大:

Q: What are the minimum and maximum storage limits of an Amazon Aurora database?

The minimum storage is 10GB. Based on your database usage, your Amazon Aurora storage will automatically grow, up to 64 TB, in 10GB increments with no impact to database performance. There is no need to provision storage in advance.

用 4.5+ 的 Linux Kernel 限制 I/O 速度

在「Using cgroups to limit I/O」這邊看到作者試著用 cgroups 限制 I/O 速度。

作者前面花了不少篇幅解釋 cgroups v1 無法正確限制 I/O 速度,後面就在講 cgroups v2 怎麼做:

So, in order to limit I/O when this I/O may hit the writeback kernel cache, we need to use both memory and io controllers in the cgroups v2!

這會需要 4.5+ 的 kernel,可能會需要手動更新,或是直接使用比較新的 distribution:

Since kernel 4.5, the cgroups v2 implementation was marked non-experimental.

然後照抄就可以了 (不過這邊的指定都需要 root,作者用 $ 表示 shell 有點怪):

# mount -t cgroup2 nodev /cgroup2
# mkdir /cgroup2/cg2
# echo "+io" > /cgroup2/cgroup.subtree_control
# echo "8:0 wbps=1048576" > io.max
# echo $$ > /cgroup2/cg2/cgroup.procs

然後就可以跑 dd 測試速度了,同時間也可以跑 iostat 看。

Twitter 打算放寬到 280 字...

Twitter 打算放寬 140 字限制:「Giving you more characters to express yourself」。

不過不包括日文、中文與韓文 XD

We want every person around the world to easily express themselves on Twitter, so we're doing something new: we're going to try out a longer limit, 280 characters, in languages impacted by cramming (which is all except Japanese, Chinese, and Korean).

然後也拿日文與英文當範例:

然後做了比較:

GitHub Apps (前身 GitHub Integrations) 的 Rate Limiting 變得更彈性

GitHub 宣佈了把 GitHub Integrations 改名為 GitHub Apps,另外 Rate Limiting 變得更彈性:「GitHub Apps (formerly Integrations) General Release」。

All GitHub Apps start with a rate limit of 5000 requests per hour. To facilitate growth we have added a dynamic rate limit for installations: organization installations with more than 20 users receive another 50 requests per hour for each user. Installations that have more than 20 repositories receive another 50 requests per hour for each repository.

所以對於超過 20 個使用者以及超過 20 個 repository 的 organization 都會增加 Rate Limiting。每個單位增加 50 requests/hour 不算太多,不過想一下好像也不少... (尤其對大一點的團體來說)

Stripe 對於控制 API 使用量的作法

在「Scaling your API with rate limiters」這篇 StripePaul Tarjan 提到了四種如何保護 API 的作法。

前兩種都是 rate limit。第一種是最標準的「你一分鐘可以用幾次」的方式,這是最容易理解的方式。第二種是「你同時間可以用幾個 API request」,這通常會用在大量消耗資源的 API 上,避免短時間內被打爆。

第三種是拉到整體來看,把 API 分成重要與不重要的,然後直接保留確保重要的 API 有一定的 capacity 可以用:

We always reserve a fraction of our infrastructure for critical requests. If our reservation number is 20%, then any non-critical request over their 80% allocation would be rejected with status code 503.

第四種方式是當過載時的自動化處理,平常就把各種工作排優先順序,當超量的時候自動先將低優先權的拿掉:

Only 100 requests were rejected this month from this rate limiter, but in the past it’s done a lot to help us recover more quickly when we have had load problems. This load shedder limits the impact of incidents that are already happening and provides damage control, while the first three are more preventative.

不過還是有點怪,Stripe 應該是全部都建在 AWS 上面 (AWS Case Study: Stripe),跟 auto scaling 的配合好像都沒提到?

歸類的方式還蠻有條理的,可以學這個方法來規劃...

Google Cloud Platform 的終身免費方案

Google 在這次 Google Cloud Next '17 公佈了 Google Cloud Platform Free TierAlways Free Usage Limits,也就是終身免費的方案。

這次宣佈的包括了許多服務。掃了一遍應該會去玩 GCE 的部份,包括了 0.6GB RAM 的機器以及 1GB 的對外流量 (不過到中國與澳洲要另外計費,不包含在這個範圍內)。

領先者的 AWS 不知道會不會也跟進...