在 Fediverse 上訂閱 X (Twitter) 的內容

Mastodon 上面看到有 bridge service 可以接 X (Twitter) 的內容:

像是 elonmusk 的 X (Twitter) 可以用 @elonmusk@bird.makeup 這個位置訂閱。

程式本身是 AGPLv3,但是是用 .NET Core 寫的,如果要 self-hosted 的話得研究一下了:「bird.makeup」。

然後從 README.md 看起來是用 X (Twitter) 自家的 undocumented API,避免了正式 API 的 rate limit 問題:

Twitter API calls are not rate-limited

Evernote 限縮免費版的方案

Evernote 這次算是完全限縮了免費版的方案:「Update: Evernote Free accounts will have fifty notes and one notebook」。

從 2023/12/04 開始限縮成一個 notebook 以及 50 個 note:

From December 4, the Evernote Free experience is changing. Going forward, new and existing Free users will have a maximum of fifty notes and one notebook per account.

從今年一月被併購後,二月就把整個 Evernote 團隊砍掉,就有蠻多人有悲觀的看法了,所以想轉的應該都已經轉的差不多了...

常看到的替代方案是 Notion 以及 Obsidian,反而沒看到什麼 open source 的討論?(剛剛找了一下,是有不少選擇,但名字都沒看過...)

下載 YouTube 影片的技術限制與繞過方法

Hacker News 上看到這篇「How They Bypass YouTube Video Download Throttling」在講 YouTube 防止下載的各種方式。

透過 API 拿到的 URL 直接抓很慢,大約 40-70KB/sec:

However, attempting to download from this URL leads to really slow download:

The speed is always limited to around 40-70kB/s.

這邊需要一個 javascript 環境計算出 n,帶入後續的 request 以「證明」你是官方的網頁 client:

Since mid-2021, YouTube has included the query parameter n in the majority of file URLs. This parameter needs to be transformed using a JavaScript algorithm located in the file base.js, which is distributed with the web page. YouTube utilizes this parameter as a challenge to verify that the download originates from an “official” client. If the challenge is not resolved and n is not transformed correctly, YouTube will silently apply throttling to the video download.

The JavaScript algorithm is obfuscated and changes frequently, so it’s not practical to attempt reverse engineering to understand it. The solution is simply to download the JavaScript file, extract the algorithm code, and execute it by passing the n parameter to it. The following code accomplishes this.

但即使算出 n,也還是會限速,可以看到作者策出來大約是 4MB/sec,雖然比以前快很多了,但還是看得出來有限速。這主要是避免 client 端過度 buffer 浪費頻寬:

With this new URL containing the correctly transformed n parameter, the next step is to download the video. However, YouTube still enforces a throttling rule. This rule imposes a variable download speed limit based on the size and length of the video, aiming to provide a download time that’s approximately half the duration of the video. This aligns with the streaming nature of videos. It would be a massive waste of bandwidth for YouTube to always provide the media file as quickly as possible.

接下來的方式就是利用 Range 拆成很多個 HTTP request 打,這樣因為 buffering algorithm 在開始限速前會先全速塞資料給你,就可以用這點避開限速的問題了。

把多的 request 與處理時間都算進去後,整體大約可以到 50-70MB/sec,算是可以接受的下載速度了:

However, the average speeds typically ranged between 50-70 MB/s or 400-560 Mb/s, which is still pretty fast.

後面有一些合併處理的指令 (因為 YouTube 會把影與音分離成兩個檔案),就不是重點了...

Amazon SQS 提高 FIFO throughput 限制

在「Amazon SQS announces increased throughput quota for FIFO High Throughput mode」這邊看到 AWS 提高了 Amazon SQS 中 FIFO throughput 的限制,這本來是個常常有的公告,但讓我意外的是不同區域拉高的數量是不同的:

Amazon Simple Queue Service (SQS) announces an increased quota for a high throughput mode for FIFO queues, allowing you to process up to 9,000 transactions per second, per API action in US East (Ohio), US East (N. Virginia), US West (Oregon), Europe (Ireland), Europe (Frankfurt) regions. For Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo) regions, the throughput quota has been increased to 4,500 transactions per second, per API action. For all other regions where SQS is generally available today, the quota for high throughput mode quota has been increased to 2,400 transactions per second.

第一梯隊的 (像是 us-east-1us-west-2eu-west-1) 都是 9000 tps,而第二梯隊是 4500 tps,沒列在上面的區域是 2400 tps。

另外一個比較特別的是 Frankfurt 區居然在第一梯隊...

Internet Archive 被打

Internet Archive 更新一篇文章,說明前幾天被打掛的事情:「Let us serve you, but don’t bring us down」。

有 64 台機器 (或是 64 個 IP) 從 AWS 打了幾萬 rps 進 Internet Archive:

Tens of thousands of requests per second for our public domain OCR files were launched from 64 virtual hosts on amazon’s AWS services. (Even by web standards,10’s of thousands of requests per second is a lot.)

然後擋掉這些 IP 後恢復正常,但過了幾個小時後又換 IP 被打了:

But, another 64 addresses started the same type of activity a couple of hours later.

找了一下之前有寫過「限制流量的方式 (rate limit)」這篇,裡面提到 Figma 怎麼處理,另外以前自己搞 apache module 後面接 memcached 達到跨機器統計的作法。

就 Internet Archive 的服務來說,是應該要搞個類似的東西來擋,不然可以預期會不斷發生?

Amazon EFS 的 file lock 限制

看到「Amazon EFS now supports a larger number of concurrent file locks」這篇提到:

This Amazon EFS update increases the number of simultaneous file locks an NFS mount can acquire to 65,536 (from 8,192 previously), enabling Amazon EFS to be used for a broader set of applications that heavily leverage file locking (including message broker and distributed analytics applications).

所以 NFS 的部份先前有 8192 個的上限在,現在則是拉 65536 個了。

不過提到 message broker,應該是以前的應用利用 NFS file lock 來做一些事情?

限制流量的方式 (rate limit)

Lobsters Daily 上看到這篇 2017 年的文章,Figma 的工程師講怎麼做 rate limit:「An alternative approach to rate limiting」,只要大一點的站台就會遇到 spammer 之類的攻擊,就會希望實做自動化的機制擋住 spammer。

文章裡面提到了三種方式,第一種 (類) 提到了經典的 Token bucketLeaky bucket,這邊文章提供的演算法是讓每個使用者都會有一筆資料紀錄在 Redis 裡面 (這邊的用法可以抽換換成 Memcached),裡面記錄了最後一次的 access time 以及還剩下多少 token 可以用,接下來就可以照時間計算 token 的補充與消耗:

但這個演算法的缺點是 race condition,需要另外設計一些機制確保操作的 atomic:

不過大多數的實做就算不管 atomic 也還行 OK,只是會比較不精確一點。

第二個方法他叫做 Fixed window counters,這個方法把時間切齊為單位 (像是 60 秒為一個 window),所以可以把起點的時間也放到 key 裡面,然後 value 就是數量:

這個作法的好處就是簡單,而且 Redis 與 Memcached 都有提供 atomic 的 +1 操作。但缺點是可能會發生兩倍以上的 request,像是 5 reqs/min 的限制有可能會有連續的一分鐘內達到 10 reqs/min:

不過我覺得就 antispam 來說算是夠用了,當年 (大概是 2007 或是 2008 年?) 在 PIXNET 時用 C 寫 Apache module 就是把資料丟到 Memcached 裡面就是這樣實做的,然後每次學術網路的實驗室跑來掃站的就會自動被擋 XDDD

第三種方式他們稱作 Sliding window log,就是把每個 request 的 timestamp 都存起來,這個部份用 Redis 的資料結構會比用 Memcached 方便一些:

這個方式在控制上更精確,不過空間成本上就高很多... 這樣算是把常見的實做方式都提到了。

PyPy JIT 的改善

PyPy 這邊看到 JIT 的重大進展:「Better JIT Support for Auto-Generated Python Code」。

他們在 Tornado 上重製出來效能問題,後面也都是用這個例子在測試:

If you render a big HTML template (example) using the Tornado templating engine, the template rendering is really not any faster than CPython.

看起來上的 workaround 是在撞到 trace limit 時標記起來,之後再遇到時就可以跳進 special mode,接著處理下去避免浪費掉之前處理過的 trace:

After we have hit the trace limit and no inlining has happened so far, we mark the outermost function as a source of huge traces. The next time we trace such a function, we do so in a special mode. In that mode, hitting the trace limit behaves differently: Instead of stopping the tracer and throwing away the trace produced so far, we will use the unfinished trace to produce machine code.

效能可以看到改善很多:

看起來這個概念有打算在 3.8 的時候放進去:

The work described in this post tiny bit experimental still, but we will release it as part of the upcoming 3.8 beta release, to get some more experience with it. Please grab a 3.8 release candidate, try it out and let us know your observations, good and bad!

Django 的 template engine 不怎麼快,用 Jinja2 可能是一個方法,但既有的 project 如果有遇到 template engine 的效能問題,也許也可以翻看看 PyPy 解得如何...

EC2 的 Monitoring 提供更多關於限流的資訊

AWS 對於 EC2 的網路推出了五個新的監控指標:「Amazon EC2 announces new network performance metrics for EC2 instances」。

看起來都是跟限流有關的指標,看起來是在壓榨機器極限時會用到:

The new metrics inform customers in real time of network traffic impacted when instance allowances for inbound and outbound bandwidth, packets-per-second (PPS), connections tracked and PPS to link-local services are exceeded.

需要有最新的 ENA driver 才會提供 (看了一下現有的機器沒出現這些值 XD):

These metrics are available today in all global commercial AWS regions on instances running the latest version of the Elastic Network Adapter (ENA) driver with support for Linux, Windows ENA driver support will be available soon with version 2.2.2.0.

不另外收費:

They can be accessed from within the instance at no extra cost using simple command line tools.

這個功能在所有 AWS 商業區以及 GovCloud (US) 都已經上線:

Instance Level Network Performance Metrics is available in all AWS Commercial and GovCloud (US) Regions, with the exception of China (Beijing) and China (Ningxia).

先記錄起來就好,一般用法應該都還好...

這幾天 blog 被掃,用 nginx 的 limit_req_zone 擋...

Update:這個方法問題好像還是不少,目前先拿掉了...

這幾天 blog 被掃中單一頁面負載會比較重的頁面,結果 CPU loading 變超高,從後台可以看到常常滿載:

看了一下是都是從 Azure 上面打過來的,有好幾組都在打,IP address 每隔一段時間就會變,所以單純用 firewall 擋 IP address 的方法看起來沒用...

印象中 nginx 本身可以 rate limit,搜了一下文件可以翻到應該就是「Module ngx_http_limit_req_module」這個,就設起來暫時用這個方式擋著,大概是這樣:

limit_conn_status 429;
limit_req_status 429;
limit_req_zone $binary_remote_addr zone=myzone:10m rate=10r/m;

其中預設是傳回 5xx 系列的 service unavailable,但這邊用 429 應該更正確,從維基百科的「List of HTTP status codes」這邊可以看到不錯的說明:

429 Too Many Requests (RFC 6585)
The user has sent too many requests in a given amount of time. Intended for use with rate-limiting schemes.

然後 virtual host 的設定檔內把某個 path 放進這個 zone 保護起來,目前比較困擾的是需要 copy & paste try_filesFastCGI 相關的設定:

    location /path/subpath {
        limit_req zone=myzone;
        try_files $uri $uri/ /index.php?$args;

        include fastcgi.conf;
        fastcgi_intercept_errors on;
        fastcgi_pass php74;
    }

這樣一來就可以自動擋下這些狂抽猛送的 bot,至少在現階段應該還是有用的...

如果之後有遇到其他手法的話,再見招拆招看看要怎麼再加強 :o