Stripe’s rate limiters are built on top of Redis, and until recently, they ran on a single very hot instance of Redis. The server had followers in place for failover, but at any given time, one node was handling every operation.
We eventually solved it by migrating to a 10-node Redis Cluster.
Consider the problem of trying to find a near-optimal version of a promotional message such as this one, which has 5 variable parts and 48 different combinations in total.
Based on the Amazon success rate and traffic size, the authors calculated it would take 66 days to conduct a 48 treatment randomized experiment. Often this isn’t practical.
也就是開頭提到的,如何一個禮拜就提昇 21% conversion rate:
Aka, “How Amazon improved conversion by 21% in a single week!”
Yesterday we saw the hard-won wisdom on display in ‘seven myths‘ recommending that experiments be kept simple and only test one thing at a time, otherwise interpreting the results can get really complicated.
Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter
但實際的運算其實沒那麼複雜,像是 Ruby 的程式碼可以看出大多都是系統內的運算就可以算出來。其中的 z 在大多數的情況下是常數。
require 'statistics2'
def ci_lower_bound(pos, n, confidence)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-(1-confidence)/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
The z-score in this function never changes, so if you don't have a statistics package handy or if performance is an issue you can always hard-code a value here for z. (Use 1.96 for a confidence level of 0.95.)
The amount of memory varies per implementation, but in the case of this implementation, we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage!
維基百科上有說明當資料量在 109 這個等級時,用 1.5KB 的記憶體只有 2% 的誤差值:
The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5 kB of memory.
All GitHub Apps start with a rate limit of 5000 requests per hour. To facilitate growth we have added a dynamic rate limit for installations: organization installations with more than 20 users receive another 50 requests per hour for each user. Installations that have more than 20 repositories receive another 50 requests per hour for each repository.
前兩種都是 rate limit。第一種是最標準的「你一分鐘可以用幾次」的方式,這是最容易理解的方式。第二種是「你同時間可以用幾個 API request」,這通常會用在大量消耗資源的 API 上,避免短時間內被打爆。
第三種是拉到整體來看,把 API 分成重要與不重要的,然後直接保留確保重要的 API 有一定的 capacity 可以用:
We always reserve a fraction of our infrastructure for critical requests. If our reservation number is 20%, then any non-critical request over their 80% allocation would be rejected with status code 503.
第四種方式是當過載時的自動化處理,平常就把各種工作排優先順序,當超量的時候自動先將低優先權的拿掉:
Only 100 requests were rejected this month from this rate limiter, but in the past it’s done a lot to help us recover more quickly when we have had load problems. This load shedder limits the impact of incidents that are already happening and provides damage control, while the first three are more preventative.