Stripe 遇到 AWS 上 DNS Resolver 的限制

當量夠大就會遇到各種限制...

這次 Stripe 在描述 trouble shooting 的過程:「The secret life of DNS packets: investigating complex networks」。

其中一個頗有趣的架構是他們在每台主機上都有跑 Unbound,然後導去中央的 DNS Resolver,再決定導去 Consul 或是 AWS 的 DNS Resolver:

Unbound runs locally on every host as well as on the DNS servers.

然後他們發現偶而會有大量的 SERVFAIL

接下來就是各種找問題的過程 (像是用 tcpdump 看情況,然後用 iptables 統計一些數字),最後發現是卡在 AWS 的 DNS Resolver 在 60 秒內只回應了 61,385 packets,換算差不多是 1,023 packets/sec,這數字看起來就很雷:

During one of the 60-second collection periods the DNS server sent 257,430 packets to the VPC resolver. The VPC resolver replied back with only 61,385 packets, which averages to 1,023 packets per second. We realized we may be hitting the AWS limit for how much traffic can be sent to a VPC resolver, which is 1,024 packets per second per interface. Our next step was to establish better visibility in our cluster to validate our hypothesis.

在官方文件「Using DNS with Your VPC」這邊看到對應的說明:

Each Amazon EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface. This limit cannot be increased. The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use. For more information and recommendations for a scalable DNS architecture, see the Hybrid Cloud DNS Solutions for Amazon VPC whitepaper.

iptables 看到的量則是:

找到問題後,後面就是要找方法解決了... 他們給了一個只能算是不會有什麼副作用的 workaround,不過也的確想不到太好的解法。

因為是查詢 10.0.0.0/8 網段反解產生大量的查詢,所以就在各 server 上的 Unbound 上指定這個網段直接問 AWS 的 DNS Resolver,不需要往中央的 DNS Resolver 問,這樣在這個場景就不會遇到 1024 packets/sec 問題了 XDDD