FreeBSD 14.0 釋出

FreeBSD 14.0-RELEASE 的公告也出來了:「FreeBSD 14.0-RELEASE Announcement」,比較完整的 release notes 在「FreeBSD 14.0-RELEASE Release Notes」。

先從官方列的 highlight 來看,首先比較重要的是 GENERIC kernel 支援 1024 cores:

FreeBSD supports up to 1024 cores on the amd64 and arm64 platforms.

看了一下 commit log 是從 256 變成 1024

先就 x86-64 這邊來看,目前「家用」最多的應該是 AMD7995WX (96 cores),舊版的 256 限制應該也還能撐住,但看 commit log 有提到,主要是預期這幾年應該會有更暴力的機器出現。

另外一塊是伺服器端,Intel 這邊有 8 sockets 的版本 (參考「Intel Xeon Sapphire Rapids to Scale to 4 and 8 Sockets」),如果都是接 8490H 的話就是 480 cores 了。

ARM 的話好像也可以堆,但不熟...

另外一個提到的重點是 TCP 預設的 congestion control 改成 CUBIC

The default congestion control mechanism for TCP is now CUBIC.

翻 commit log 可以看到是從 NewReno 換成 CUBIC 的,這樣就跟 Linux kernel 預設值一樣了。

再來比較重要的是在 release notes 裡面提到的,FreeBSD 15.0 將會拔光 32-bit 環境的支援,只留 armv7,這代表 Raspberry Pi 第一代的 armv6 也被淘汰掉了:

FreeBSD 15.0 is not expected to include support for 32-bit platforms other than armv7. The armv6, i386, and powerpc platforms are deprecated and will be removed. 64-bit systems will still be able to run older 32-bit binaries.

然後有些我自己翻覺得還蠻有趣的。

首先是看到 non-root 的 chroot

The chroot facility supports unprivileged operation, and the chroot(8) program has a -n option to enable its use. a40cf4175c90 (Sponsored by EPSRC)

然後把 OpenSSH 內對 FIDO/U2F 的支援開起來了:

The use of FIDO/U2F hardware authenticators has been enabled in ssh, using the new public key types ecdsa-sk and ed25519-sk, along with corresponding certificate types. FIDO/U2F support is described in https://www.openssh.com/txt/release-8.2. e9a994639b2a (Sponsored by The FreeBSD Foundation)

ASLR 預設開啟:

Address Space Layout Randomization (ASLR) is enabled for 64-bit executables by default. It can be disabled as needed if applications fail unexpectedly, for example with segmentation faults. To disable for a single invocation, use the proccontrol(1) command: proccontrol -m aslr -s disable command. To disable ASLR for all invocations of a binary, use the elfctl(1) command: elfctl -e +noaslr file. Problems should be reported via the problem reporting system, https://bugs.freebsd.org, or posting to the freebsd-stable@FreeBSD.org mailing list. b014e0f15bc7 (Sponsored by Stormshield)

然後先前被罵臭頭的 WireGuard 支援也放回來了:(「FreeBSD & pfSense 上的 WireGuard 問題」)

The kernel wg(4) WireGuard driver has been reintegrated; it provides Virtual Private Network (VPN) interfaces using the WireGuard protocol. 744bfb213144 (Sponsored by Rubicon Communications, LLC ("Netgate") and The FreeBSD Foundation)

然後看到 Netflix 贊助的 kTLS 支援 TLS 1.3:

KTLS (the kernel TLS implementation) has added receive offload support for TLS 1.3. Receive offload is now supported for TLS 1.1 through 1.3; send offload is supported for TLS 1.0 through 1.3. 05a1d0f5d7ac (Sponsored by Netflix)

然後 FreeBSD 長久以來 root 預設用的 /bin/csh 改成 /bin/sh 了:

The default shell for the root user is now sh(1), which has many new features for interactive use. d410b585b6f0

預設的 MTA 變成 dma (Dragonfly Mail Agent),看名字加上翻了一下 manpage,確認是從 Dragonfly BSD 移植過來的:

The default mail transport agent (MTA) is now the Dragonfly Mail Agent (dma(8)) rather than sendmail(8). Configuration of the MTA is done via mailer.conf(5). sendmail(8) and its configuration remain available. a67b925ff3e5

然後 portsnap 被拔掉了,現在就建議直接用 git 拉了,算是功成身退了:

The portsnap(8) utility has been removed. Users are encouraged to fetch the ports tree by using pkg install git and then git clone https://git.FreeBSD.org/ports.git /usr/ports. df53ae0fdd98

而 mergemaster 也被換成 etcupdate 了:

mergemaster(8) has been deprecated. Its replacement is etcupdate(8). 398b12691b4f (Sponsored by The FreeBSD Foundation)

然後支援 tarfs,而且可以用 zstd

The tarfs(5) file system has been added, which is backed by POSIX tar archives optionally compressed with zstd(1). 69d94f4c7608 (Sponsored by Juniper Networks, Inc.) (Sponsored by Klara, Inc.)

好久沒看 FreeBSD 的 release notes...

Cloudflare Workers 推出新的計費模式:以 CPU time 收費

CloudflareWorkers 這個產品 (serverless 產品線) 推出了以 CPU time 收費的模式:「New Workers pricing — never pay to wait on I/O again」。

在這之前大家都是以 wall time 在計費,但這對於會卡在 I/O 很久的應用來說很不利,這次 Cloudflare 提出方案改用 CPU time 來計費,的確有吸引到我的目光...

這是舊的:

這是新的:

就這兩張比較起來有個不是很確定的部分,現在不看 memory 用量收費了?Pricing 這頁裡面的資量已經把 「Standard」更新上去了,但好像還是沒提到 memory?

不過不是馬上生效,而是這個月的月底 2023/10/31 可以選擇切到新的方案:

Starting October 31, 2023, you will have the option to opt in individual Workers and Pages Functions projects on your account to new pricing, and newly created projects will default to new pricing. You’ll be able to estimate how much new pricing will cost in the Cloudflare dashboard. For the majority of current applications, new pricing is the same or less expensive than the previous Bundled and Unbound pricing plans.

另外明年 2024/03/01 會全部強制切到新的方案:

If you’re on our Workers Paid plan, you will have until March 1, 2024 to switch to the new pricing on your own, after which all of your projects will be automatically migrated to new pricing. If you’re an Enterprise customer, any contract renewals after March 1, 2024, will use the new pricing.

用 CPU time 的確是好不少,但不知道這個 billing 的方式沒有其他地雷...

Cavium (被 Marvell 併購) 在 Snowden leak 中被列為 SIGINT "enabled" vendor

標題可能會有點難懂,比較簡單的意思就是在 Snowden 當年 (2013) 洩漏的資料裡面發現了不太妙的東西,發現 Cavium (現在的 Marvell) 的 CPU 有可能被埋入後門,而他們家的產品被一堆廠商提供的「資安產品」使用。

出自 X (Twitter) 上面提到的:

這段出可以從 2022 年的「Communication in a world of pervasive surveillance」這份文件裡面找到,就在他寫的 page 71 (PDF 的 page 90) 的 note 21:

While working on documents in the Snowden archive the thesis author learned that an American fabless semiconductor CPU vendor named Cavium is listed as a successful SIGINT "enabled" CPU vendor. By chance this was the same CPU present in the thesis author’s Internet router (UniFi USG3). The entire Snowden archive should be open for academic researchers to better understand more of the history of such behavior.

Ubiquiti 直接中槍...

而另一方面,在 Hacker News 上的討論「Snowden leak: Cavium networking hardware may contain NSA backdoor (twitter.com/matthew_d_green)」就讓人頭更痛了,像是當初 Cavium 就有發過新聞稿提到他們是 AWS CloudHSM 的供應商:「Cavium's LiquidSecurity® HSM Enables Hybrid Cloud Users to Synchronize Keys Between AWS CloudHSM and Private Clouds」。

而使用者也確認有從 log 裡面看到看到 Cavium 的記錄:

Ayup. We use AWS CloudHSM to hold our private signing keys for deploying field upgrades to our hardware. And when we break the CI scripts I see Cavium in the AWS logs.

Now I gotta take this to our security team and figure out what to do.

居然是 CloudHSM 這種在架構上幾乎是放在 root of trust 上的東西...

Intel 對於在 E-cores 上面可以跑 AVX-512 指令集的計畫:AVX10.2

看到「Intel AVX10.2 ISA to enable AVX-512 capabilities on E-cores」這篇提到了 Intel 的技術文件「The Converged Vector ISA: Intel® Advanced Vector Extensions 10」,裡面提到了 Intel 後續對 AVX-512 的計畫。

主要是這張,可以看到在 AVX10.2 的規劃中會支援 E-cores:

不過目前還要等,這邊只放了一個 future 的說明:

目前的傳言是 2024 或 2025 會有 AVX10.1 在 Xeon 上出來:

Intel says that version 1 of the AVX10 vector ISA (AVX10.1) will first be implemented on Intel Xeon “Granite Rapids” processors that, according to some media reports, are expected to launch by 2024 or 2025, so it will likely take a long while before AVX10.2 is implemented on processors with E-cores.

但 AVX10.1 還沒有在 E-cores 上面執行 AVX512 的能力,所以 AVX10.2 應該是更後面...

Amazon EC2 推出 m7a 系列的機種

上一篇完全讀錯段落了,重寫...

Amazon EC2 推出了新的 m7a 的機種:「New – Amazon EC2 M7a General Purpose Instances Powered by 4th Gen AMD EPYC Processors」。

號稱與 m6a 相比有 50% 效能上的提升:

Today, we’re announcing the general availability of new, general purpose Amazon EC2 M7a instances, powered by the 4th Gen AMD EPYC (Genoa) processors with a maximum frequency of 3.7 GHz, which offer up to 50 percent higher performance compared to M6a instances.

不過查了一下價錢,us-east-1m6a.large 是 $0.0864/hr,m7a.large 則是 $0.11592/hr (都是 2 vCPU + 8GB RAM),漲了 34% 左右,如果計算 price performance 的話大約是 10%~15%?的確是不高所以不提 price performance,不過這次 m7a 提供了更小台的 m7a.medium (1 vCPU + 4GB RAM) 來補這塊 (m6a 最小的是 m6a.large),$0.05796/hr。

這樣看起來新的機種對於需要單核效能的應用應該會不錯?

再來是可以租到的區域,目前看起來只有歐美的傳統大區有,亞洲區還要再等等:

Amazon EC2 M7a instances are now available today in AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), and EU (Ireland).

Amazon EC2 推出新世代 Intel CPU,以及奇怪的 Flex 產品

Jeff Barr 公告了 Amazon EC2 推出了新的 Intel CPU 產品線:「New Seventh-Generation General Purpose Amazon EC2 Instances (M7i-Flex and M7i)」。

先講 m7i,這個比較好理解,就是 Intel 新的 CPU,然後很隱晦的只宣稱了「比其他雲端廠商的 Intel CPU 快 15%」:

Today we are launching Amazon Elastic Compute Cloud (Amazon EC2) M7i-Flex and M7i instances powered by custom 4th generation Intel Xeon Scalable processors available only on AWS, that offer the best performance among comparable Intel processors in the cloud – up to 15% faster than Intel processors utilized by other cloud providers.

而與前一代的 Intel 機種相比 (應該是 m6i?) 則是高了 15% 的 CP 值:

The M7i instances are available in nine sizes (with two size of bare metal instances in the works), and offer 15% better price/performance than the previous generation of Intel-powered instances.

另外這次最有趣 (但未必好用) 的是推出了 m7i-flex,宣稱再「很多情境下」比 m6i 的 CP 值高了 19%:

M7i-Flex instances are available in the five most common sizes, and are designed to give you up to 19% better price/performance than M6i instances for many workloads.

這邊有提到 m7i-flex 是把 m7i 砍了 5% 價錢:

The M7i-Flex instances are a lower-cost variant of the M7i instances, with 5% better price/performance and 5% lower prices. They are great for applications that don’t fully utilize all compute resources. The M7i-Flex instances deliver a baseline of 40% CPU performance, and can scale up to full CPU performance 95% of the time.

但在「General purpose instances」這邊比較清楚,但可以跑超過 40% CPU 的時間限制在 24 小時內的 95%:

For times when workloads need more performance, M7i-flex instances provide the ability to exceed baseline CPU and deliver up to 100 percent CPU for 95 percent of the time over a 24-hour window.

這個設計頗微妙的,旁邊 ARM-based 的 t4g 就先不提了,至少不是 drop-in replacement...

m7i.large (2 vCPU + 8GB RAM) 在 us-east-1 的價錢是 $0.1008/hr,你可以完全沒有阻礙的用他 100% 的 CPU。

m7i-flex.large 是 $0.09576/hr,剛好省 5%,代價是有 5% 的時間必須壓在 40% 以下。

而拿 t3 系列的機器來比較,t3.large 也是有 2 vCPU + 8GB RAM,但他內建的 baseline CPU 是 30%,價錢則是 $0.0832/hr。

從這邊可以看出來,大多數的小應用還是會往 t3 甚至是 t3a 丟。

只有 24 小時的平均 loading 超過 40%,但又不是 24 小時都超過 40% 的應用,也就是現在應該是在 m6am6i 上面跑的,才有可能會評估 m7i-flex

更不用說 AWS 上 m6a 的收費比 Intel 的 m6i 少了 10% 啊,這個產品定位在細算後很微妙:應該是有他可以出現的地方,但怎麼算都不多... 有點像是 AWS 跟 Intel 交代的產品?

但順著邏輯,這個方法其實是 billing-based 的方案,跟技術沒有太多關係,如果 Intel 可以做,那麼 AMD 與 ARM 應該也遲早會出現?

看起來像是 t 系列產品的延伸,但好像可以再等等,看會不會在 AMD 與 ARM 的產品線上推出類似的東西?

OpenLLM,用 Python 包裝 open source LLM 的套件

Hacker News 上看到「OpenLLM (github.com/bentoml)」,是一個用 Python 寫的軟體,把 open source LLM 包裝起來讓你用。

先拿 Mac 簡單測了一下,看起來包的不錯,可以用 HTTP API 來打。

先用 pip 裝:

pip install openllm

然後就可以把 server 跑起來了,依照範例跑 dolly-v2,第一次跑會比較久,需要下載 model:

openllm start dolly-v2

接下來就可以直接開 http://127.0.0.1:3000/ 來操作了,另外也可以用 command line 跑,像是依照官方的範例來跑:

openllm query --endpoint http://127.0.0.1:3000 "What is the meaning of life?"

目前測到比較明顯的問題是 CPU 模式下只有 single thread,所以雖然會動,但相當慢... 之後再來測試 GPU 的部分。

AWS 新推出的 m7a 宣稱比 m6a 多 50% 效能?

AWS 在「Introducing Amazon EC2 M7a instances (Preview)」這邊看到 m7a 會比 m6a 快 50% 的宣稱:

These instances deliver up to 50% greater performance on average compared to M6a instances.

目前還是 preview 階段,需要申請才有機會用,所以還不知道他的真實性能是怎麼樣?另外一方面,價錢也還沒查到... 但如果價錢不要漲太多的話,算一下好像有可能跟上 ARMm7g 了?

另外這樣也就蠻值得期待會不會有 t4a

llama.cpp 有全 GPU 版本了

Hacker News 首頁上看到「Llama.cpp: Full CUDA GPU Acceleration (github.com/ggerganov)」,對應得原頁面在「CUDA full GPU acceleration, KV cache in VRAM #1827」這邊。

裡面是在講 llama.cpp 之前的 GPU 加速還是有不少事情是在 CPU 上面做,這次是把目前 ggml 支援的操作都實作 GPU 版本了:

This PR adds GPU acceleration for all remaining ggml tensors that didn't yet have it. Especially for long generations this makes a large difference because the KV cache is still CPU only on master and gets larger as the context fills up.

蠻多人有不同測試的結果,要注意這次不是把 CPU 搬到 GPU 上面做,而是把本來因為比較 light 而還沒搬上 GPU 的部分搬上去,所以不會是數量級的加速,但看起來改善也已經很不賴了:

Early attempt this morning we're getting ~2.5-2.8x perf increase on 4090s and about 1.8-2x on 3090Ti.

然後 Falcon... 目前看起來還沒有必較好的進展 XD

Georgi Gerganov 成立公司 GGML

Hacker News 首頁上看到 Georgi Gerganov 成立公司的計畫:「GGML – AI at the Edge (ggml.ai)」,官網在「GGML - AI at the edge」。

如同 Georgi Gerganov 提到的,llama.cpp 這些專案本來是他的 side project,結果意外的紅起來:

另外他提到了 Nat FriedmanDaniel Gross 也幫了一把:

在官網則是有提到是 pre-seed funding:

ggml.ai is a company founded by Georgi Gerganov to support the development of ggml. Nat Friedman and Daniel Gross provided the pre-seed funding.

現在回頭來看,當初 llama.cpp 會紅起來主要是因為 CPU 可以跑 LLaMA 7B,而且用 CPU 跑起來其實也不算慢。

後來吸引了很多人一起幫忙,於是有了不少 optimization (像是「llama.cpp 的載入速度加速」這邊用 mmap 減少需要載入的時間,並且讓多個 process 之間可以重複使用 cache),接下來又有 GPU 的支援...

但不確定他開公司後,長遠的計畫是什麼...?