ARM 版的 Windows 宣稱要改善 x86 轉譯速度

在「Microsoft gives Windows new compiler, kernel, scheduler, and x86 translation layer on ARM」這邊看到的:

Microsoft also unveiled the name for its new x86 translation layer for Windows on ARM: Prism. Microsoft told Ars Technica that Prism is as fast as Apple’s Rosetta 2, which is interesting because Apple’s M series chips contain special silicon to speed up the translation process, making me wonder if Qualcomm has done the same, or is just brute-forcing it.

看起來之前 Windows 平板上跑 x86 應用程式很慢的痛點有機會改善?另外不知道技術相似度如何,有沒有機會看到細節分析...

用 x86 組語寫的 web forum

另外一個在週末看到的奇怪東西,用 x86 組語寫的 web forum:「AsmBB – a lightweight web forum engine written in assembly language (asmbb.org)」,官網在 asmBB 這邊,說明則是在「What is AsmBB?」這頁。

是個 x86 assembly + SQLite 的組合:

AsmBB is fully written in assembly language and uses SQLite as a database back-end. That is why it can work on really weak hosting and in the same time serve huge amount of visitors without lags and delays.

程式碼是用 Fossil 管的,不過作者有 mirror 到 GitHub 上:「johnfound/asmbb」。

SQlite 的部分用了 SQLeet,可以在 sqleet.inc 這個檔案裡面看到引用了 libsqleet.so

AsmBB uses SQLite with SQLeet extension as a back-end storage database and FastCGI interface to the web server.

組語的部分看起來是用 fasm 寫的,看程式碼 (像是 encryption.asm) 可以看到一些好用的語法,像是自動處理 function 參數的部分 (應該有設定某種 convention),跟當年寫 MASM 的用法方便不少:

stdcall TextCreate, sizeof.TText
mov     edx, eax

stdcall TextCat, edx, <"Content-type: text/html; charset=utf-8", 13, 10, 13, 10>
stdcall RenderTemplate, edx, 'crypto.tpl', 0, esi

世紀帝國系列裡的 AoE 與 AoK 使用組合語言加速的情況

Age of Empires (1997) 與 Age of Empires II: The Age of Kings (1999) 是當年兩個很紅的遊戲,有傳言裡面有很大一部分是用組合語言寫的。

剛剛在 Hacker News 上看到「Original Age of Empires 2 dev talks about its usage of assembly code (reddit.com)」這篇,看起來是原作者跳出來說明這段傳言 (講的更細),原文在 Reddit 的「AoE is written in Assembly - is this actually TRUE?! :O : aoe2」這邊。

32-bit 的 x86 組語部分有大約 13k 行,大多數 (11.5k 行) 是與繪圖有關的 (當年繪圖相關的功能都還是 CPU 在算):

There were about ~13,000 lines of x86 32-bit assembly code written in total.

The vast majority, about ~11,500 lines worth, was in the 'drawing core', which drew SLP sprites in a variety of ways (mirrored, stippled, clipped). The code for this was in a separate .asm file, which was "compiled" by Microsoft Macro Assembler 6.1 into a .obj file.

最主要的目的當然是效能,這邊提到隔壁棚的星海爭霸預設還是 640x480 的時候,AoE 這邊預設可以上 800x600:

The use of assembly in the drawing core resulting in a ~10x sprite drawing speed improvement over the C++ reference implementations, and AoE's drawing core was notably faster than competitors like StarCraft, which is why the default resolution 'out of the box' for AoE was 800 by 600, when nearly all our competition was 640 x 480 resolution - we could scroll the screen and fill it with sprites as fast or faster even though we had twice as many pixels-ish in the game world area.

後來到 64-bit 年代就變成用 C++,主要是因為 compiler optimization 的改進,以及多 CPU 與 multi-threading 的技術,讓計算力提升很多:

I re-wrote the assembly functions into C++ for both Definitive Editions, as they are 64-bit programs, and inline assembly was never supported by the 64-bit C++ compiler, and the vastly improved register sets and compiler optimizations made it un-necessary. Additionally, sprite drawing in the definitive editions is multi-threaded, and will use up to 4 cores for that task alone.

算是時代的演進,意外的留下一些記錄... 而更後面的時代則是把這些計算 offload 到顯卡上面了。

AWS 新推出的 m7a 宣稱比 m6a 多 50% 效能?

AWS 在「Introducing Amazon EC2 M7a instances (Preview)」這邊看到 m7a 會比 m6a 快 50% 的宣稱:

These instances deliver up to 50% greater performance on average compared to M6a instances.

目前還是 preview 階段,需要申請才有機會用,所以還不知道他的真實性能是怎麼樣?另外一方面,價錢也還沒查到... 但如果價錢不要漲太多的話,算一下好像有可能跟上 ARMm7g 了?

另外這樣也就蠻值得期待會不會有 t4a

立端科技的 IIoT-I530

因為工作的關係,所以會關注一些特殊的硬體,但好像暫時找不到地方放,就丟在 blog 上面記錄好了...

這次看到的是支援一堆 PoE+ 的機器:「Tiger Lake-U system features dual 2.5GbE and six PoE+ ports」。

除了 PoE+ 以外另外有 mSATASATA 支援,然後還有一堆 M.2 的界面可以接 (好像是走 PCIe):

Lanner’s “IIoT-I530” embedded PC runs Linux on an 11th Gen U-series CPU and supplies with up to 64GB RAM, 2x 2.5GbE, 6x PoE+, 2x COM, 4x USB 3.0, 2x HDMI, 3x M.2, SATA, mSATA, and DIO.

把 blog 搬到 t4g.small 上

算了一下成本還可以接受 (機器 + 空間 + 流量),就把 blog 搬到 AWSt4g.small (ARM) 上,理論上頁面的速度應該會快不少,過幾天等穩定性沒問題後就來買 RI...

x86-64 轉到 ARM 上面,主要是 Percona Server 目前沒有提供 ARM binary 的 apt repository,所以就改用 MariaDB 了。

其他的倒是都差不多,目前的 Ubuntu + nginx + PHP 沒什麼問題,跑一陣子看看...

微軟開源 1983 年版的 GW-BASIC

微軟用 MIT License 放出 1983 年版的 GW-BASIC:「Microsoft Open-Sources GW-BASIC」。

這次放出來程式看起來是 x86 assembly,不過放出來的版本好像也不能算是「原始」的版本,而是從 "master implementation" 轉譯出來的版本:

This source was ‘translated’?

Each of the assembly source files contains a header stating This translation created 10-Feb-83 by Version 4.3

Since the Instruction Set Architecture (ISA) of the early processors used in home and personal computers weren’t spectacularly different from one another, Microsoft was able to generate a substantial amount of the code for a port from the sources of a master implementation. (Alas, sorry, we’re unable to open-source the ISA translator.)

主要還是 PR,然後帶一些考古價值...

在 x86-64 上跑 Raspberry Pi 的 OS

看到「dockerpi」這個專案,讓你可以在 x86-64 上模擬 Raspberry Pi 環境跑 Raspbian

然後整包是先透過 Docker 產出一個獨立環境,然後裡面跑 QEMU 模擬 ARM 的環境,接下來再跑 Raspbian:

A full ARM environment is created by using Docker to bootstrap a QEMU virtual machine. The Docker QEMU process virtualises a machine with a single core ARM11 CPU and 256MB RAM, just like the Raspberry Pi. The official Raspbian image is mounted and booted along with a modified QEMU compatible kernel.

這馬上讓人想到 Inception 啊 XDDD

Anyway,這個方法對於想玩玩的人可以省不少功夫,是個有趣的專案就是了...

反過來在 ARM 上面跑 x86 模擬器...

看到「Box86 - Linux Userspace x86 Emulator with a twist, targeted at ARM Linux devices」這個專案,在 ARM 的機器上跑 x86 模擬器:

然後其實作者還弄了不少東西,像是透過 GL4ES 處理 OpenGL 的界面,其實前面做的功夫比想像中高不少...

然後從作者給的範例可以看出來主力在遊戲 XDDD

C++ 與組語的速度...

Hacker News Daily 上看到「Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?」覺得很有趣...

作者寫了一段 assembly,但跑起來比用 C++ 同義的版本慢多了。目前最高分的答案給了很清楚的解釋...

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

上面這段 code 是作者寫的組語版本,用到 div 指令,這是非常慢的指令:

On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles.

相較於 C++ 的版本,用到的是 shr (logical shift right,以位元方式往右平移,最高位補零),速度快太多:

shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.

這是用到無號整數透過 shr 平移一格剛好是除以二的特性,因為速度的關係,這個用法到現在還是很常被拿來用,但對於平常沒在寫 assembly 的人就會有上面的誤解 XDDD