FreeBSD 14.0 對比於 13.2 有顯著的效能提升

Hacker News 上看到「FreeBSD 14.0 Delivering Great Performance Uplift (phoronix.com)」這篇,原文在「FreeBSD 14.0 Is Delivering Great Performance Uplift & Running Well In Early Tests」這邊。

測試平台是 AMD EPYC™ 8534P (64 cores & 128 threads),是個今年九月才推出的 CPU,另外底層 filesystem 是跑 ZFS

翻了一輪測試的資訊,幾乎是所有的項目都有提升 (少數幾樣有些微退步),但以 Phoronix 的測試計算,整體計算起來有 18% 的提升,對於 OS 升級帶來的提升算是蠻巨大的:

Across the span of five dozen benchmarks carried out on this AMD EPYC 8534P server of FreeBSD 13.2 vs. FreeBSD 14.0, the newly-released FreeBSD 14 was on average 18% faster than its predecessor. Not bad for a simple OS upgrade. I've been seeing very healthy gains on other x86_64 servers tested so far while due to hardware availability haven't yet tried any AArch64 servers.

依照他的說明,後續會有跟其他 Linux distribution 的比較,到時候可以回來再看看:

I'll be running more FreeBSD 14.0 server benchmarks shortly along with following that up by looking at the FreeBSD 14.0 performance against the latest leading Linux distributions. In any event I'm quite happy thus far with the performance and experience in my FreeBSD 14.0 testing.

回頭看報告裡面比較特別的部分,一個是 OpenSSL 的部分有下滑一些,這點應該跟版本更新有關,在 FreeBSD 14.0 的 Release Notes 裡面有提到大版本升級:

OpenSSL has been upgraded to version 3.0.12. This is a major upgrade from version 1.1.1, which has reached its end of life. Many components of the base system use a backward-compatible API, but will be migrated later. aa7957345732 930cec16d9ee b077aed33b7b (Sponsored by The FreeBSD Foundation)

另外一點是在 Page 4 裡面,可以看到 PostgreSQL 16 的效能提升非常明顯,無論是 TPS 還是 latency 都有非常巨大的改善。

Ruby 3.3 的速度再次提升

在「Ruby 3.3's YJIT Runs Shopify's Production Code 15% Faster」這篇提到了 Ruby 3.3 的速度再次提升的消息。

Shopify 上面的測試,3.3.0-preview2 的速度已經比 3.2.2 (兩者都有開啟 YJIT) 快了 13%,而且 p50/p90/p99 都有對應的改善:

不過有提到一些要注意的點,像是記憶體的用量又會再更高 (本來開 YJIT 的時候就已經有增加了),如果是對記憶體比較敏感的環境,會需要注意這點:

Since Ruby 3.3.0-preview2 YJIT generates more code than Ruby 3.2.2 YJIT, this can result in YJIT having a higher memory overlead. We put a lot of effort into making metadata more space-efficient, but it still uses more memory than Ruby 3.2.2 YJIT. We’re looking into skipping compilation of paths that are less frequently executed.

但 server 端應該是還好 (記憶體給多一點),整體是個可以期待的方向...

Amazon EC2 推出 m7a 系列的機種

上一篇完全讀錯段落了,重寫...

Amazon EC2 推出了新的 m7a 的機種:「New – Amazon EC2 M7a General Purpose Instances Powered by 4th Gen AMD EPYC Processors」。

號稱與 m6a 相比有 50% 效能上的提升:

Today, we’re announcing the general availability of new, general purpose Amazon EC2 M7a instances, powered by the 4th Gen AMD EPYC (Genoa) processors with a maximum frequency of 3.7 GHz, which offer up to 50 percent higher performance compared to M6a instances.

不過查了一下價錢,us-east-1m6a.large 是 $0.0864/hr,m7a.large 則是 $0.11592/hr (都是 2 vCPU + 8GB RAM),漲了 34% 左右,如果計算 price performance 的話大約是 10%~15%?的確是不高所以不提 price performance,不過這次 m7a 提供了更小台的 m7a.medium (1 vCPU + 4GB RAM) 來補這塊 (m6a 最小的是 m6a.large),$0.05796/hr。

這樣看起來新的機種對於需要單核效能的應用應該會不錯?

再來是可以租到的區域,目前看起來只有歐美的傳統大區有,亞洲區還要再等等:

Amazon EC2 M7a instances are now available today in AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), and EU (Ireland).

下載 YouTube 影片的技術限制與繞過方法

Hacker News 上看到這篇「How They Bypass YouTube Video Download Throttling」在講 YouTube 防止下載的各種方式。

透過 API 拿到的 URL 直接抓很慢,大約 40-70KB/sec:

However, attempting to download from this URL leads to really slow download:

The speed is always limited to around 40-70kB/s.

這邊需要一個 javascript 環境計算出 n,帶入後續的 request 以「證明」你是官方的網頁 client:

Since mid-2021, YouTube has included the query parameter n in the majority of file URLs. This parameter needs to be transformed using a JavaScript algorithm located in the file base.js, which is distributed with the web page. YouTube utilizes this parameter as a challenge to verify that the download originates from an “official” client. If the challenge is not resolved and n is not transformed correctly, YouTube will silently apply throttling to the video download.

The JavaScript algorithm is obfuscated and changes frequently, so it’s not practical to attempt reverse engineering to understand it. The solution is simply to download the JavaScript file, extract the algorithm code, and execute it by passing the n parameter to it. The following code accomplishes this.

但即使算出 n,也還是會限速,可以看到作者策出來大約是 4MB/sec,雖然比以前快很多了,但還是看得出來有限速。這主要是避免 client 端過度 buffer 浪費頻寬:

With this new URL containing the correctly transformed n parameter, the next step is to download the video. However, YouTube still enforces a throttling rule. This rule imposes a variable download speed limit based on the size and length of the video, aiming to provide a download time that’s approximately half the duration of the video. This aligns with the streaming nature of videos. It would be a massive waste of bandwidth for YouTube to always provide the media file as quickly as possible.

接下來的方式就是利用 Range 拆成很多個 HTTP request 打,這樣因為 buffering algorithm 在開始限速前會先全速塞資料給你,就可以用這點避開限速的問題了。

把多的 request 與處理時間都算進去後,整體大約可以到 50-70MB/sec,算是可以接受的下載速度了:

However, the average speeds typically ranged between 50-70 MB/s or 400-560 Mb/s, which is still pretty fast.

後面有一些合併處理的指令 (因為 YouTube 會把影與音分離成兩個檔案),就不是重點了...

改善 Wikipedia 的 JavaScript,減少 300ms 的 blocking time

Hacker News 首頁上看到「300ms Faster: Reducing Wikipedia's Total Blocking Time」這篇,作者 Nicholas RayWikimedia Foundation 的工程師,雖然是貼在自己的 blog 上,但算是半官方性質了... 文章裡面提到了兩個改善都是跟前端 JavaScript 有關的。

作者是透過瀏覽器端的 profiling 產生火焰圖,判讀裡面哪塊是大塊的問題,然後看看有沒有機會改善。

先看最後的成果,可以看到第一個 fix 讓 blocking time 少了 200ms 左右,第二個 fix 則是少了 80ms 左右:

第一個改善是從火焰圖發現 l._enable 吃掉很多 blocking time:

作者發現是因為 find() 找出所有的連結後 (a 元素),跑去每一個連結上面綁定事件造成的效能問題:

The .on("click") call attached a click event listener to nearly every link in the content so that the corresponding section would open if the clicked link contained a hash fragment. For short articles with few links, the performance impact was negligible. But long articles like ”United States” included over 4,000 links, leading to over 200ms of execution time on low-end devices.

但這其實是 redundant code,在其他地方已經有處理了,所以解法就比較簡單,拔掉後直接少了 200ms:

Worse yet, this behavior was unnecessary. The downstream code that listened to the hashchange event already called the same method that the click event listener called. Unless the window’s location already pointed at the link’s destination, clicking a link called the checkHash method twice — once for the link click event handler and once more for the hashchange handler.

第二個改善是 initMediaViewer 吃掉的 blocking time,從 code 也可以看到問題也類似,跑一個 loop 把事件掛到所有符合條件的元素上面:

這邊的解法是 event delegation,把事件掛到上層的元素,就只需要掛一個,然後多加上檢查事件觸發的起點是不是符合條件就可以了,這樣可以大幅降低「掛」事件的成本。

這點算是常用技巧,像是 table 裡面有事件要掛到很多個 td 的時候,會改成把一個事件掛到 table 上面,另外加上判斷條件。

算是蠻標準的 profiling 過程,直接拉出真實數據來看,然後調有重大影響的部分。

llama.cpp 開始支援 GPU 了

前陣子因為重灌桌機,所以在重建許多環境... 其中一個就是 llama.cpp,連到專案頁面上時意外發現這兩個新的 feature:

OpenBLAS support
cuBLAS and CLBlast support

這代表可以用 GPU 加速了,所以就照著說明試著編一個版本測試。

編好後就跑了 7B 的 model,看起來快不少,然後改跑 13B 的 model,也可以把完整 40 個 layer 都丟進 3060 (12GB 版本) 的 GPU 上:

./main -m models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 40

從 log 可以看到 40 layers 到都 GPU 上面,吃了 7.5GB 左右:

llama.cpp: loading model from models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 7562 MB
llama_init_from_file: kv self size  =  400.00 MB

30B 的 model 我也試著丟上去跑,但只能丟 28 layers 上去 (全部是 60 layers),再多 GPU 的記憶體就撐不住了。

但能用 GPU 算是一個很大的進展,現在這版只快了一半的時間,不知道後面還有沒有 tune 的空間...

Ruff:用 Rust 寫的 Python Linter

Hacker News Daily 上看到「Astral (astral.sh)」這個,網站在「Astral: Next-gen Python tooling」。

裡面提到的 Ruff 專案是一套用 Rust 寫的 Python Linter,主打就是速度,從官網提供的 benchmark 就可以看出來差距:

因為是 Python ecosystem 的東西,安裝可以直接用 pip 裝預設編好的套件,而不需要透過 cargo 自己編 (當然你想要還是可以用 cagro 編)。

feedgen 測了一下,速度是真的快,這樣就比較不會嫌棄了... 要注意會冒出 .ruff_cache/ 目錄,.gitignore 要加一下。

然後用預設值先掃出 unused import 修掉,其他的有機會再看要怎麼改。

llama.cpp 的載入速度加速

Hacker News 上看到「Llama.cpp 30B runs with only 6GB of RAM now (github.com/ggerganov)」這個消息,原 pull request 在「Make loading weights 10-100x faster #613」這邊。

這個 PR 的作者 Justine Tunney 在 PR 上有提到他改變 model 檔案格式,以便改用 mmap(),大幅降低了需要預先讀取的時間 (因為變成 lazy-loading style),而且這也讓系統可以利用 cache page,避免了 double buffering 的問題:

This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o.

這讓我想到在資料庫領域中,PostgreSQL 也會用 mmap() 操作,有點類似的概念。

另外 Justine Tunney 在這邊的 comment 有提到一個意外觀察到的現象,他發現實際在計算的時候用到的 model 內容意外的少:他用一個簡單的 prompt 測試,發現 20GB 的 30B model 檔案在他的 Intel 機器上實際只用到了 1.6GB 左右:

If I run 30B on my Intel machine:

[...]

As we can see, 400k page faults happen, which means only 1.6 gigabytes ((411522 * 4096) / (1024 * 1024)) of the 20 gigabyte weights file actually needed to be used.

這點他還在懷疑是不是他的修改有 bug,但目前他覺得不太像,也看不太出來:

Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse. Maybe we made some kind of rare mistake where llama.cpp is somehow evaluating 30B as though it were the 7B model. Anything's possible, however I don't think it's likely. I was pretty careful in writing this change, to compare the deterministic output of the LLaMA model, before and after the Git commit occurred. I haven't however actually found the time to reconcile the output of LLaMA C++ with something like PyTorch. It'd be great if someone could help with that, and possibly help us know why, from more a data science (rather than systems engineering perspective) why 30B is sparse.

如果不是 bug 的話,這其實冒出了一個很有趣的訊號,表示這些 model 是有可能再瘦身的?

Ruby 再引入另外一套 JIT 實做:RJIT

Hacker News Daily 上看到「RJIT #7448」這個,Ruby 上一套新的 JIT 實做。

這次的 RJIT 取代掉先前的 MJIT:

This PR replaces the current implementation of MJIT with a new JIT called "RJIT"

有些特點,其中一個是 RJIT 在 buildtime 與 runtime 都不需要 compiler,這是因為 RJIT 直接用 Ruby 實做:

RJIT uses a pure-Ruby assembler to generate native code

  • MJIT requires a C compiler at runtime. YJIT requires a Rust compiler at build time. RJIT doesn't require them.
  • This means that RJIT's warmup could be slower than YJIT, but it's still much faster than MJIT's.

另外值得注意的是,RJIT 的作者 k0kubunYJIT 的作者 Maxime Chevalier-Boisvert 都是 Shopify 的員工,可以看出 Shopify 對於 Ruby 效能的痛?決定直接自己養人改善效能。

回到 RJIT 這邊跑的測試,可以看到他是用 YJIT 的測試套件測,這也就不會太奇怪了。

跟這次取代掉的 MJIT 相比,RJIT 在 Headline 這包測試都 OK,在 Other 這包則是有來有回,而在 Micro 這包則是有不少項目輸掉 (相比於前兩者):

這樣整體看起來算是有進步,下一版 Ruby 更新應該就會有了。

由 pnpm 做出來的效能比較:npm、yarn 以及 pnpm

找資料的時候看到 pnpm 有在做幾個常見的 javascript package manager 的比較:「Benchmarks of JavaScript Package Managers」。

裡面測了九個項目,其中前八個剛好就是 cache 的有無、lockfile 的有無以及 node_modules 的有無,這三者的組合剛好八個,最後一個是測 update 的速度。

pnpm 因為用了 hard link,我先跳過去,只先關注 npmyarn 的差異。

其中幾個比較重要的項目是 (由最重要開始列):

  • with cache, with lockfile, with node_modules:這個會是一般開發的情境。
  • update:套件有更新時的時間。
  • with lockfile:這個會是第一次 clone 下來後的環境,以及 CI 的環境。

在一般開發環境下,npm 比 yarn 快一些,這點讓我比較意外,我知道 npm 這幾年一直有在改善效率,但沒想到會在 benchmark 下反超,這個是以前 yarn 很大的宣傳點,現在已經不見了。

第二個是 update (upgrade) 的部份,這邊 yarn 比 npm 快一些。

第三個因為是 CI 環境,或是第一次 clone 時的環境,慢通常可以被接受,不要差太多就好,而這邊 npm 比 yarn 快一些。

但不管是哪個,差距都不像以前那麼大了,就效率上來說,官方的 npm 已經沒有太明顯的缺點,因為效率的差異而去選擇 yarn 的理由已經不存在了。