IntelAMD GPU 直接跑 CUDA 程式的 ZLUDA

先前提過「在 Intel 內顯上面直接跑 CUDA 程式的 ZLUDA」,結果後來事情大翻轉,AMD 跑去贊助專案,變成支援 AMD GPU 了:「AMD Quietly Funded A Drop-In CUDA Implementation Built On ROCm: It's Now Open-Source」,專案在 GitHubvosen/ZLUDA 這邊,而這包支援 AMD GPU 的 commit log 則是在 1b9ba2b2333746c5e2b05a2bf24fa6ec3828dcdf 這包巨大的 commit:

Nobody expects the Red Team

Too many changes to list, but broadly:
* Remove Intel GPU support from the compiler
* Add AMD GPU support to the compiler
* Remove Intel GPU host code
* Add AMD GPU host code
* More device instructions. From 40 to 68
* More host functions. From 48 to 184
* Add proof of concept implementation of OptiX framework
* Add minimal support of cuDNN, cuBLAS, cuSPARSE, cuFFT, NCCL, NVML
* Improve ZLUDA launcher for Windows

其中的轉折以及後續的故事其實還蠻不知道怎麼說的... 作者一開始在 Intel 上班,弄一弄 Intel 覺得這沒前景,然後 AMD 接觸後贊助這個專案,到後面也覺得沒前景,於是依照後來跟 AMD 的合約,如果 AMD 覺得沒前景,可以 open source 出來:

Why is this project suddenly back after 3 years? What happened to Intel GPU support?

In 2021 I was contacted by Intel about the development od ZLUDA. I was an Intel employee at the time. While we were building a case for ZLUDA internally, I was asked for a far-reaching discretion: not to advertise the fact that Intel was evaluating ZLUDA and definitely not to make any commits to the public ZLUDA repo. After some deliberation, Intel decided that there is no business case for running CUDA applications on Intel GPUs.

Shortly thereafter I got in contact with AMD and in early 2022 I have left Intel and signed a ZLUDA development contract with AMD. Once again I was asked for a far-reaching discretion: not to advertise the fact that AMD is evaluating ZLUDA and definitely not to make any commits to the public ZLUDA repo. After two years of development and some deliberation, AMD decided that there is no business case for running CUDA applications on AMD GPUs.

One of the terms of my contract with AMD was that if AMD did not find it fit for further development, I could release it. Which brings us to today.

這個其實還蠻好理解的,CUDA 畢竟是 Nvidia 家的 ecosystem,除非你反超越後自己定義一堆自家專屬的功能 (像是當年 MicrosoftIE 上的玩法),不然只是幫人抬轎。

Phoronix 在 open source 前幾天先拿到軟體進行測試,而他這幾天測試的結果給了「頗不賴」的評價:

Andrzej Janik reached out and provided access to the new ZLUDA implementation for AMD ROCm to allow me to test it out and benchmark it in advance of today's planned public announcement. I've been testing it out for a few days and it's been a positive experience: CUDA-enabled software indeed running atop ROCm and without any changes. Even proprietary renderers and the like working with this "CUDA on Radeon" implementation.

另外為了避免測試時有些測試軟體會回傳到伺服器造成資訊外洩,ZLUDA 在這邊故意設定為 Graphics Device,而在這次 open source 公開後會改回正式的名稱:

In my screenshots and for the past two years of development the exposed device name for Radeon GPUs via CUDA has just been "Graphics Device" rather than the actual AMD Radeon graphics adapter with ROCm. The reason for this has been due to CUDA benchmarks auto-reporting results and other software that may have automated telemetry, to avoid leaking the fact of Radeon GPU use under CUDA, it's been set to the generic "Graphics Device" string. I'm told as part of today's open-sourcing of this ZLUDA on Radeon code that the change will be in place to expose the actual Radeon graphics card string rather than the generic "Graphics Device" concealer.

作者的測試看起來在不同的測試項目下差異頗大,但如果依照作者的計算方式,整體效能跟 OpenCL 版本差不多:

Phoronix 那邊則是做了與 Nvidia 比較的測試... 這邊拿的是同樣都有支援 Nvidia 與 AMD 家的卡的 Blender 測試,然後跑出來的結果讓人傻眼,透過 ZLUDA 轉譯出來的速度比原生支援的速度還快,這 optimization 看起來又有得討論了:(這是 BMW27 的測試,在 Classroom 的測試也發現一樣的情況)

但即使如此,CUDA over AMD GPU 應該還是不會起來,官方會儘量讓各 framework 原生支援,而大多數的開發者都是在 framework 上面開發,很少會自己從頭幹...

在 Intel 內顯上面直接跑 CUDA 程式的 ZLUDA

Hacker News 首頁上看到的有趣東西:「Zluda: Run CUDA code on Intel GPUs, unmodified (github.com/vosen)」,專案在「CUDA on Intel GPUs」這邊,這是個最後更新在 2021 年的專案。

這個專案的想法可以猜得出來,想要吃 CUDA 的 ecosystem,把現有用 CUDA 的應用程式直接跑在 Intel 的 GPU 上面,這樣對於一些只有 CUDA 卻沒有 OpenCL 的實作就有機會拿來用。

一開始本來以為是給 Intel 新的獨立顯卡 Arc,結果發現是 2021 年就停更的專案,是以內顯來測試的:

ZLUDA performance has been measured with GeekBench 5.2.3 on Intel UHD 630.

從 benchmark 的結果看起來,大多數的功能應該都有 porting 上去,所以至少測試是能跑的,而不是 crash:

不過 Hacker News 的討論上可以看到似乎還是有問題,而且大多數的 AI 應用還是會回頭支援 OpenCL,似乎沒有那麼好用...