AdRoll 用 DynamoDB Autoscaling 的踩雷記錄,裡面有些資訊如果不是跳下去玩應該不會注意到 (魔鬼藏在細節裡的感覺):「Managing DynamoDB Autoscaling with Lambda and Cloudwatch」。
第一個提到的問題是 autoscaling 的觀察對象:
Ideally, the table should scale based on the number of requests that we are making , not the number of requests that are successful.
另外一個是 autoscaling 遇到完全不用的情況下不會 scale down,看起來是某種保護機制。但這使得平常只有拿來讀取的表格在跑完 batch job 後得自己處理 write scale down 問題:
Additionally, at the time of implementing this algorithm, the DynamoDB capacity could not be brought down automatically if the consumption was exactly zero, which can happen if you write to your table in batch instead of realtime, for example.
This meant that, when enabling autoscaling, tables that were read in realtime, but written to in batch, still needed manual intervention to bring the write capacity down after our jobs were done writing.
另外一個問題是 scale down 是有次數限制的:
Another interesting point that might bite users is that capacity decreases are an expensive operation for AWS, so they’re limited.
The number of decreases cited in the documentation can be achieved under very special conditions, since you need to have 4 decreases in the first hour of the day plus one for each of the remaining hours, for a total of 4 (first hour) + 23 (1 hourly) = 27.
後面就是自己研究什麼 algorithm 可以調整的更細,然後用 lambda 重寫... 最後省下 30% 的成本:
Here is where we detected our costs for our batch tables dropping to around 30% of the initial cost.
AdRoll 的規模應該是不小,所以為了省 30% 可以花不少力氣在上面...