Intel 用 AVX-512 加速 NumPy 的排序演算法被整合進主線了

IntelAVX-512 加速 NumPy 排序的實做被整合進主線了:「「Intel Publishes Blazing Fast AVX-512 Sorting Library, Numpy Switching To It For 10~17x Faster Sorts」」。

GitHub 的 PR 在「ENH: Vectorize quicksort for 16-bit and 64-bit dtype using AVX512 #22315」這邊,可以看到相關的留言:

This patch adds AVX512 based 64-bit on AVX512-SKX and 16-bit sorting on AVX512-ICL. All the AVX512 sorting code has been reformatted as a separate header files and put in a separate folder. The AVX512 64-bit sorting is nearly 10x faster and AVX512 16-bit sorting is nearly 16x faster when compared to std::sort. Still working on running NumPy benchmarks to get exact benchmark numbers

16-bit int sped up by 17x and float64 by nearly 10x for random arrays. Benchmarked on a 11th Gen Tigerlake i7-1165G7.

有點「有趣」的情況是,AVX-512 在新的 Intel 消費級 CPU 被拔掉了,只有伺服器工作站的 CPU 有保留。而 AMDZen 4 則是跳下去支援 AVX-512...

另外在「Intel Publishes Fast AVX-512 Sorting Library, 10~17x Faster Sorts in NumPy (phoronix.com)」這邊當然也有人提到,如果用更廣泛的 AVX2 (寬度是 256bits) 加速的話應該也會有很大的進步才對?幾乎這十年的 CPU 都有 AVX2 了... 不過看起來沒有什麼深入的討論?

NIST 更新了 SHA-1 的淘汰計畫

NISTSHA-1 的新的淘汰計畫出來了:「NIST Retires SHA-1 Cryptographic Algorithm」。

先前 NIST 在 2004 年時是計畫在 2010 年淘汰掉 SHA-1,在「NIST Brief Comments on Recent Cryptanalytic Attacks on Secure Hashing Functions and the Continued Security Provided by SHA-1」這邊可以看到當時的宣佈:

The results presented so far on SHA-1 do not call its security into question. However, due to advances in technology, NIST plans to phase out of SHA-1 in favor of the larger and stronger hash functions (SHA-224, SHA-256, SHA-384 and SHA-512) by 2010.

但看起來當時沒有強制性,所以事情就是一直拖一直延期,中間經過了 2017 年 GoogleCWI Amsterdam 展示的 SHA-1 collision:「Google 與 CWI Amsterdam 合作,找到 SHA-1 第一個 collision」。

以及 2020 年時的進展與分析,發現 chosen-prefix collision 已經是可行等級了:「SHA-1 的 chosen-prefix collision 低於 2^64 了...」。

然後 NIST 總算是想起來要更新 phase out 的計畫,現在最新的計畫是在 2030 年年底淘汰掉 SHA-1:

As today’s increasingly powerful computers are able to attack the algorithm, NIST is announcing that SHA-1 should be phased out by Dec. 31, 2030, in favor of the more secure SHA-2 and SHA-3 groups of algorithms.

這次就有一些強制的規範了,包括採購的部份:

“Modules that still use SHA-1 after 2030 will not be permitted for purchase by the federal government,” Celi said.

但 2030 年聽起來還是有點慢...

網頁大小 14KB 與 15KB 的速度差異

Hacker News 上看到「Why your website should be under 14kB in size」這篇,對應的討論在「A 14kb page can load much faster than a 15kb page (endtimes.dev)」,在講網頁大小 14KB/15KB 的速度差異比 15KB/16KB 大很多的問題:

What is surprising is that a 14kB page can load much faster than a 15kB page — maybe 612ms faster — while the difference between a 15kB and a 16kB page is trivial.

原因是 TCP slow start 造成的:

This is because of the TCP slow start algorithm.

而網頁這邊 TCP slow start 目前大多數的實做都是 10 packets 後發動:

Most web servers TCP slow start algorithm starts by sending 10 TCP packets.

然後再組合 1500 bytes/packet 以及 overhead,就差不多是 14KB 了:

The maximum size of a TCP packet is 1500 bytes.

This this maximum is not set by the TCP specification, it comes from the ethernet standard

Each TCP packet uses 40 bytes in its header — 16 bytes for IP and an additional 24 bytes for TCP

That leaves 1460 bytes per TCP packet. 10 x 1460 = 14600 bytes or roughly 14kB!

然後 HTTP/3 也可以看到類似的設計 (出自「QUIC Loss Detection and Congestion Control」:

Sending multiple packets into the network without any delay between them creates a packet burst that might cause short-term congestion and losses. Implementations MUST either use pacing or limit such bursts to the initial congestion window, which is recommended to be the minimum of 10 * max_datagram_size and max(2* max_datagram_size, 14720)), where max_datagram_size is the current maximum size of a datagram for the connection, not including UDP or IP overhead.

算是一個小知識... 但對於現在肥滋滋的網頁效果來說就沒辦法了,而且考慮到大一點的網站會在一個 TCP 連線裡面可能會傳很多 request,其實早就超過 TCP slow start 的門檻了。

Post-Quantum 的 KEM,SIDH/SIKE 確認死亡

似乎是這幾天 cryptography 領域裡面頗熱鬧的消息,SIDH 以及 SIKE 確認有嚴重的問題:「SIKE Broken」,論文在「An efficient key recovery attack on SIDH (preliminary version)」這邊可以取得。

這次的成果是 Key recovery attack,算是最暴力的幹法,直接把 key 解出來。

另外 SIKE 剛好也是先前 Cloudflare 在解釋 Hertzbleed 時被拿來打的目標:「Cloudflare 上的 Hertzbleed 解釋」,這樣看起來連 patch 也都不用繼續研究了...

論文裡面的攻擊對象中,第一個是 Microsoft$IKE challenges 內所定義的 $IKEp182 與 $IKEp217,在只用 single core 的情況下,分別在四分鐘與六分鐘就解出來:

Ran on a single core, the appended Magma code breaks the Microsoft SIKE challenges $IKEp182 and $IKEp217 in about 4 minutes and 6 minutes, respectively.

接著是四個參與 NIST 標準選拔的參數,分別是 SIKEp434、SIKEp503、SIKEp610 以及 SIKEp751,也都被極短的時間解出來:

A run on the SIKEp434 parameters, previously believed to meet NIST’s quantum security level 1, took about 62 minutes, again on a single core.

We also ran the code on random instances of SIKEp503 (level 2), SIKEp610 (level 3) and SIKEp751 (level 5), which took about 2h19m, 8h15m and 20h37m, respectively.

Ars Technica 的採訪「Post-quantum encryption contender is taken out by single-core PC and 1 hour」裡面,有問到 SIKE 的共同發明人 David Jao 的看法,他主要是認為密碼學界的人對於數學界的「武器」了解程度不夠而導致這次的情況:

It's true that the attack uses mathematics which was published in the 1990s and 2000s. In a sense, the attack doesn't require new mathematics; it could have been noticed at any time. One unexpected facet of the attack is that it uses genus 2 curves to attack elliptic curves (which are genus 1 curves). A connection between the two types of curves is quite unexpected. To give an example illustrating what I mean, for decades people have been trying to attack regular elliptic curve cryptography, including some who have tried using approaches based on genus 2 curves. None of these attempts has succeeded. So for this attempt to succeed in the realm of isogenies is an unexpected development.

In general there is a lot of deep mathematics which has been published in the mathematical literature but which is not well understood by cryptographers. I lump myself into the category of those many researchers who work in cryptography but do not understand as much mathematics as we really should. So sometimes all it takes is someone who recognizes the applicability of existing theoretical math to these new cryptosystems. That is what happened here.

這樣第四輪的選拔只剩下三個了...

NIST 選出了四個 Post-Quantum Cryptography 演算法

NIST (NSA) 選出了四個 Post-quantum cryptography 演算法 (可以抵抗量子電腦的演算法):「NIST Announces First Four Quantum-Resistant Cryptographic Algorithms」。

四個演算法分別是:

  • CRYSTALS-Kyber:非對稱加密。
  • CRYSTALS-Dilithium:數位簽名。
  • FALCON:數位簽名。
  • SPHINCS+:數位簽名。

這次沒看到非對稱加解密的演算法...

然後翻了 Hacker News 上的討論,果然一堆人在討論 NIST 能不能信任的問題:「NIST Announces First Four Quantum-Resistant Cryptographic Algorithms (nist.gov)」。

然後據說 Kyber 這個名字出自 Star Wars,Dilithium 這個名字則是出自 Star Trek,這還真公平 XDDD

Perl 的 Regular Expression 的強度:NP-complete

這篇稍微偏 CS 理論一些...

以前在學校學 Formal language 的時候會帶出 Grammer、Language、Automaton 三個項目,就像是維基百科上的條列:

裡面可以看到經典的 Regular expression 會被分到 RG/RL/FSM 這三塊。

前幾天看到 gugod 寫的「[Perl] 以正規表示式來定義文法規則」這篇,裡面試著用 Perlregular expression (perlre) 建構「遞歸下降解析器」 (Recursive descent parser)。

Recursive descent parser 可以當作是 CFG 的子集合,而 CFG 對應到的語言是 CFL,另外他對應到的自動機是 PDA

我們已經知道 perlre 因為支援一堆奇怪的東西 (像是 backreference 或是 recursive pattern),所以他能接受的 language 已經超過 RL,但我很好奇他能夠做到什麼程度。

用搜尋引擎翻了翻,查到對 PCRE 的分析 (這是一套與 Perl regular expression 語法相容的 library):「Which languages do Perl-compatible regular expressions recognize?」。

在裡面有人提到「The true power of regular expressions」這篇文章,裡面給了一個在 PTIME 演算法,將 3SAT 轉換到 PCRE 裡解,這證明了 PCRE 是 NP-hard;另外也很容易確認 PCRE 是 NP,所以就達成了 NP-complete 的條件了...

本來一直以為 PCRE 只是 CFG/CFL/PDA 而已,沒想到這麼強,NPC 表示大多數現有的演算法都可以轉成 PCRE 形式放進去跑... XD

圖片無損壓縮下的演算法比較

Hacker News 上看到「What’s the best lossless image format? Comparing PNG, WebP, AVIF, and JPEG XL」這篇,在講圖片的無損壓縮演算法。在 Hacker News 上的討論也可以看看:「What’s the best lossless image format? (siipo.la)」。

文章有點舊 (2021 年七月),但應該還行... 另外作者看起來是以 service bandwidth 考量為主,在這種情境下,自然圖片一般都會以非無損的方式提供 (像是 JPEG),而人造圖片則是以無損的方式提供 (像是 PNG),所以在這邊討論無損的時候會以人造圖片的 dataset 來挑選,於是作者是跑去 Dribbble 上翻圖片當 dataset:

What I ended up with was downloading a set of images from Dribbble, a portfolio site for designers.

最後的結果就是:

考慮到目前各家瀏覽器的支援性,可以看到 Lossless WebP 其實是個很好的選擇,檔案算蠻小的,而且 Apple ecosystem 的支援性也已經出來了:

如果不用考慮到瀏覽器的話,JPEG XL 也可以考慮,不過本來宣稱 royalty-free 的部份蒙上了陰影:「Alarm raised after Microsoft wins data-encoding patent」,用的人反而要注意到 patent 問題...

Cloudflare 推出了讓你買 cache 空間的 Cache Reserve

這幾天 Cloudflare 推出了一大包東西,其中一個是 Cache Reserve:「Introducing Cache Reserve: massively extending Cloudflare’s cache」。

一般的使用情境是依照 LRU 演算法在決定 Cloudflare 的 cache 滿的時候要排除誰:

We do eviction based on an algorithm called “least recently used” or LRU. This means that the least-requested content can be evicted from cache first to make space for more popular content when storage space is full.

Cache Reserve 就是自己買 cache 空間,他的作法是你付 R2 的空間費用:

Cache Reserve is a large, persistent data store that is implemented on top of R2.

這樣就可以完全依照 Cache-Control 這類 HTTP header 內的時間保存了,你就不用擔心會被 purge 掉,首先價錢包括了 R2 的部份:

The Cache Reserve Plan will mimic the low cost of R2. Storage will be $0.015 per GB per month and operations will be $0.36 per million reads, and $4.50 per million writes.

另外還有還沒公告的 Cache Reserve 的部份:

(Cache Reserve pricing page will be out soon)

對於很極致想要拼 hit rate 的使用者來說是個選擇就是了,另外可以想到直播相關的協定 (像是 HLS) 好像可以這樣搞來壓低對 origin server 的壓力?

Tor 0.4.7.7 支援 congestion control

Tor 首度在協定內支援了 congestion control:「Congestion Control Arrives in Tor 0.4.7-stable!」。

這個新功能會帶來效能的提昇:

Tor has released 0.4.7.7, the first stable Tor release with support for congestion control. Congestion control will eliminate the speed limit of current Tor, as well as reduce latency by minimizing queue lengths at relays. It will result in significant performance improvements in Tor, as well as increased utilization of our network capacity.

之所以沒有辦法直接利用 packet loss 的方式讓 TCP network stack 直接判斷 congestion control,是因為這樣會產生 side channel:

Crucially, we rejected mechanisms to provide congestion control by allowing packet drops, due to the ability to introduce end-to-end side channels in the packet drop pattern.

所以 Tor 得自己實做 congestion control 演算法,選擇的演算法是結合了 Vegas 的 Tor-Vegas,可以看到在實驗中,德國與香港的 exit node 效率大幅提昇:

另外也因為 0.4.7.7 也出來一個禮拜了,也可以看到 Advertised Bandwidth (算是 Tor network 觀察到的 bandwidth) 開始成長:

另外一個重要的點是 UDP 的支援計畫,看起來在這次改善後也比較有可行性了:

The astute reader will note that we rejected datagram transports. However, this does not mean that Tor will never carry UDP traffic. On the contrary, congestion control deployment means that queue delay and latency will be much more stable and predictable. This will enable us to carry UDP without packet drops in the network, and only drop UDP at the edges, when the congestion window becomes full. We are hopeful that this new behavior will match what existing UDP protocols expect, allowing their use over Tor.

在 PostgreSQL 上直接掛 ML extension

Hacker News 首頁上看到「Show HN: PostgresML, now with analytics and project management (postgresml.org)」這個專案,可以在 PostgreSQL 上面直接掛 extension 跑 ML algorithm:「PostgresML - an end-to-end machine learning solution」,從 GitHub 上可以看到大多數是 Python 的程式碼。

從 GitHub 頁面上面可以看到這個專案還在比較早期的階段:

This project is currently a proof of concept. Some important features, which we are currently thinking about or working on, are listed below.

如果是目前要用的話,主要是方便看一些東西吧?可以想到的是掛個 replication 出來跑一些 query,這樣不會影響到 production database 的效能,應該還行...

另外看了一下支援的演算法,主要是以經典的 ML 演算法為主,而且就是套用 Python 上面的套件:XGBoostscikit-learn

這些演算法算是很好用了,而且掛到 PostgreSQL 裡面會讓使用上方便很多 (少了倒資料的動作,不過就得小心處理 dirty data 了),然後專案也附上一個 UI 界面可以看一些資料,不過我猜還是用其他生 visualization 的工具會比較豐富一點:

另外一個想法是拿來學習還不錯?老師在上課的時候拿來示範一些演算法,就不用自己再刻很多程式碼...