一個檔案直接跑起大型語言模型的 llamafile

llamafile 是昨天很紅的一個專案,由 Mozilla Internet Ecosystem (MIECO) 弄出來的專案,可以使用一個檔案直接跑起大型語言模型的 HTTP server,讓你可以在瀏覽器裡面直接使用。

直接看官方的 README.md 就可以蠻簡單的跑起來,不過 Simon Willison 也有寫一篇文章介紹一下,可以看看:「llamafile is the new best way to run a LLM on your own computer」。

這邊說的「一個檔案」是指同一個檔案同時可以在 WindowsmacOSLinuxFreeBSDOpenBSD 以及 NetBSD 上面跑,而且這個檔案也把大型語言模型 (LLM) 的 model 檔案包進去,所以檔案會蠻大的,但畢竟就是方便讓人使用:

下載下來直接執行,預設就會在 port 8080 跑起來,可以直接連到 http://127.0.0.1:8080/ 連進去使用。

llamafile 用到的技術是 Cosmopolitan 專案,可以把多個平台的執行檔包在同一個檔案裡面使用。

另外用到的專案是 llama.cpp,這個蠻多人都用過了,可以很方便的用 CPU 或是 GPU 跑 LLM。

我在 Linux 上面跑剛好遇到幾個問題,都是在 README.md 上面有提到的。

第一個是 zsh 無法直接跑 llamafile (Ubuntu 22.04 內建 zsh 的是 5.8.1),這邊官方的建議是用 sh -c ./llamafile 避開:

If you use zsh and have trouble running llamafile, try saying sh -c ./llamafile. This is due to a bug that was fixed in zsh 5.9+. The same is the case for Python subprocess, old versions of Fish, etc.

另外一個對是 GPU 的支援,這邊跟你說加上 --n-gpu-layers 35 就可以用,所以一開始先用 sh -c ./llamafile --n-gpu-layers 35 試著跑:

On Linux, Nvidia cuBLAS GPU support will be compiled on the fly if (1) you have the cc compiler installed, (2) you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enable GPU, and (3) the CUDA developer toolkit is installed on your machine and the nvcc compiler is on your path.

但可以看到沒有被 offload 到 GPU 上面:

llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 4165.47 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB

嘗試了不同的方法,發現要跑 sh -c "./llamafile --n-gpu-layers 35",也就是把參數一起包進去,這樣就會出現對應的 offload 資訊,而且輸出也快很多:

llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   70.42 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.05 MB

玩了一下像是這樣:

Evernote 限縮免費版的方案

Evernote 這次算是完全限縮了免費版的方案:「Update: Evernote Free accounts will have fifty notes and one notebook」。

從 2023/12/04 開始限縮成一個 notebook 以及 50 個 note:

From December 4, the Evernote Free experience is changing. Going forward, new and existing Free users will have a maximum of fifty notes and one notebook per account.

從今年一月被併購後,二月就把整個 Evernote 團隊砍掉,就有蠻多人有悲觀的看法了,所以想轉的應該都已經轉的差不多了...

常看到的替代方案是 Notion 以及 Obsidian,反而沒看到什麼 open source 的討論?(剛剛找了一下,是有不少選擇,但名字都沒看過...)

Amazon Q (來猜名字的由來...)

AWS 推出了 Amazon Q,目前還在 preview 階段:「Amazon Q brings generative AI-powered assistance to IT pros and developers (preview)」。

產品本身主要就是 LLM 的應用,以現在來說沒有太特別,主要是 Hacker News 上大家在猜這個 Q 到底是取自哪裡:「Amazon Q (Preview) (amazon.com)」。

看到有猜 Q (Star Trek)Q-learning 以及 Q (James Bond)

id=38448900 這邊有人提到 Q 是 question:

from the NYTimes article: The name Q is a play on the word “question,” given the chatbot’s conversational nature, Mr. Selipsky said. It is also a play on the character Q in the James Bond novels, who makes stealthy, helpful tools, and on a powerful “Star Trek” figure, he added.

找了一下應該是「Amazon Introduces Q, an A.I. Chatbot for Companies」這篇文章,因為 paywall 的關係可能看不到全文,可以看 archive.today 這邊的 archive:「Amazon Introduces Q, an A.I. Chatbot for Companies」。

反正牽扯的到的都提一下...

AWS 推出 RDS for Db2 以及 Amazon Aurora Limitless Database

資料庫的部分,看到 AWS 在這次 re:Invent 放出「Getting started with new Amazon RDS for Db2」與「Join the preview of Amazon Aurora Limitless Database」這兩則消息。

首先是看到 Db2 這個詞覺得怪,查了以後發現原來是 2017 年改過名字正式的拼法:

The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

這個消息比較像是補產品線,最大的資訊其實是查資料知道 2017 年改成 Db2...

另外的新產品 Amazon Aurora Limitless Database 就比較特別了,我翻了一下先前的 Amazon Aurora Serverless,是屬於利用 read replica 拓展「讀」的承載能力:

Amazon Aurora Serverless is an on-demand, autoscaling configuration for Amazon Aurora. It automatically starts up, shuts down, and scales capacity up or down based on your application's needs.

這次在 Amazon Aurora Limitless Database 裡面直接提到 sharding 技巧了,這就有機會拓展「寫」的承載能力:

Shards are Aurora PostgreSQL DB instances that each store a subset of the data for your database, allowing for parallel processing to achieve higher write throughput. Transaction routers manage the distributed nature of the database and present a single database image to database clients.

這次 preview 推出的是 PostgreSQL 15 的版本:

The preview runs in a new Aurora PostgreSQL cluster with version 15 in the AWS US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) Regions.

這代表 database schema 不要設計的太差,是蠻有機會讓 RDS 幫你處理 sharding 底層所有需要的細節?等後續價錢出來可以看看... (應該是 open/public preview 或是 open beta 的階段)

AWS 推出 Amazon S3 Express One Zone

AWS 推出了以效能為導向的 Amazon S3 Express One Zone:「Announcing the new Amazon S3 Express One Zone high performance storage class」。

從名字裡的 One Zone 可以看到這是只有在一個 AZ,主打超低 latency:

The new Amazon S3 Express One Zone storage class is designed to deliver up to 10x better performance than the S3 Standard storage class while handling hundreds of thousands of requests per second with consistent single-digit millisecond latency, making it a great fit for your most frequently accessed data and your most demanding applications.

但費用相當貴,以 us-east-1 來看的話是 $0.16/GB/mo,如果拿其他一些 storage 方案來比,可以看到非常大的差距:

  • S3 Standard:$0.023/GB/mo
  • General Purpose SSD (gp3):$0.08/GB/mo
  • General Purpose SSD (gp2):$0.1/GB/mo

可以猜測後面應該全是 NVM 之類的 storage (不過文章裡沒有提到)。

這次的 Amazon S3 Express One Zone 也多出了很多特別的限制。

首先是新的 bucket type,在這個 bucket type 下面 ListObjectsV2 呼叫就必須以 / 結尾 (這暗示後面的資料處理有對這點 optimization),另外傳回的資料不保證順序了:

The path delimiter must be “/“, and any prefixes that you supply to ListObjectsV2 must end with a delimiter. Also, list operations return results without first sorting them, so you cannot do a “start after” retrieval.

另外看起來是在 AZ 裡面直接認證,所以有新的 authentication model:

The new CreateSession function returns a session token that grants access to a specific bucket for five minutes.

然後 bucket naming 因為有後處理,在命名上不需要在整個 AWS 是唯一的 (因為被加料了):

Directory bucket names must be unique within their AWS Region, and must specify an Availability Zone ID in a specially formed suffix. If my base bucket name is jbarr and it exists in Availability Zone use1-az5 (Availability Zone 5 in the US East (N. Virginia) Region) the name that I supply to CreateBucket would be jbarr--use1-az5--x-s3.

另外資料還是可以在同一個 region 下跨 AZ 存取,而且同一個 region 下面的 compute resources (像是 EC2) 不收傳輸費用:

Although the bucket exists within a specific Availability Zone, it is accessible from the other zones in the region, and there are no data transfer charges for requests from compute resources in one Availability Zone to directory buckets in another one in the same region.

費用的部分還有個比較特別的但書,超過 512KB 的 request 會需要額外收費:

You pay an additional per-GB fee for the portion of any request that exceeds 512 KB. For more information, see the Amazon S3 Pricing page.

主要是給自己開發的應用程式用的,現有的 framework 大多都有利用 batch & buffering 的技巧降低 latency 所帶來的效能影響。

平常應該是用不太到,但就有個印象,真的在架構設計上跑不掉的時候有個選擇...

Nextcloud 吃下 Roundcube

Nextcloud 官方的公告在「Open source email pioneer Roundcube joins the Nextcloud family」這邊,然後有新聞整理:「Roundcube Open-Source Webmail Software Merges With Nextcloud」,以及 Hacker News 上對應的討論:「Roundcube open-source webmail software merges with Nextcloud (phoronix.com)」。

居然看到 Roundcube 的新聞,這是個用 PHP 寫的,頗老牌的 Webmail 系統了,翻了 Wikipedia 上的資料,第一個 stable 居然是 2008 年?我以為應該更早,因為印象中當年交大的 D2 E-mail 系統在後來有用到...?

後來的情況有點微妙,2015/2016 年的時候 Roundcube 搞了 crowdfunding 結果變成一場災難:

On 3 May 2015, Roundcube announced, in partnership with Kolab Systems AG, that they planned to completely rewrite Roundcube and create Roundcube Next. A crowdfunding campaign was set up to finance the project. The goal of $80,000 was reached on June 24. The final amount raised was US$103,541.

Roundcube Next was intended to include additional features like calendar, chat and file management. This was to be implemented using WebRTC and connectors from popular services like Dropbox and OwnCloud.

However, Kolab Systems and Roundcube stopped development on the project in 2016, with no information or refunds provided to project backers, leading to a failed crowdfund. A Roundcube developer later claimed Roundcube had no ownership over the Roundcube Next campaign,[10] despite its public engagement and ownership on the crowdfund page.

這次的情況從 Hacker News 上的討論也看得出來,大家對 Nextcloud 沒什麼好感,而且 Nextcloud 本身有個 Nextcloud Mail,沒看懂到底是怎麼一回事...

Broadcom 在併購 VMware 後裁員

Hacker News 上看到「Broadcom lays off many VMware employees after closing acquisition (businessinsider.com)」這篇,前幾天 Broadcom 買下 VMware 的消息才公告出來 (參考「Broadcom 買下 VMware」),馬上就有裁員的消息了... 原始報導在「Broadcom lays off many VMware employees after closing its $69 billion acquisition of the company」,不過因為 paywall 的關係,可以看參考 archive.today 上面的頁面:「Broadcom lays off many VMware employees after closing its $69 billion acquisition of the company」。

報導裡面有提到,在併購前 VMware 自己就有在砍人了,不過併購完馬上就再砍,不太好看...

還是如同之前提到的,看不太懂這個併購案的目的,看起來只是單純往軟體產業投資?因為 VMware 跟先前併購的 CA TechnologiesSymantec 好像是沒有太直接可以拼湊起來的感覺...

Georgi Gerganov 給了在 AWS 上面用 GPU instance 跑 llama.cpp 的說明

Georgi Gerganov 寫了一篇怎麼在 AWS 上面用 GPU instance 跑 llama.cpp 的說明:「Using llama.cpp with AWS instances #4225」。

先跳到最後面的懶人套件,直接提供了 shell script 幫你弄完:

bash -c "$(curl -s https://ggml.ai/server-llm.sh)"

回到開頭的部分,機器的選擇上面,他選了一台最便宜的 4 vCPU + 16GB RAM + 16GB VRAM 的機器來跑。

然後他提到了 OpenHermes-2.5-Mistral-7B 這個模型最近很紅,也許有機會看一下:

We have just 16GB VRAM to work with, so we likely want to choose a 7B model. Lately, the OpenHermes-2.5-Mistral-7B model is getting some traction so let's go with it.

用 llama.cpp 裡面的 server 跑起 API server:

./server -m models/openhermes-7b-v2.5/ggml-model-q4_k.gguf --port 8888 --host 0.0.0.0 --ctx-size 10240 --parallel 4 -ngl 99 -n 512

接著就可以用 cURL 測試:

curl -s http://XXX.XXX.XXX.XXX:8888/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
        "model": "gpt-3.5-turbo",
        "messages": [
            {
                "role": "system",
                "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
            },
            {
                "role": "user",
                "content": "Write a limerick about python exceptions"
            }
        ]
    }' | jq

都包好了...

zlib、gzip 與 zip 的關係 (Mark Adler 的回覆)

Hacker News 上看到「How are zlib, gzip and zip related? (stackoverflow.com)」這個討論,原文是 StackOverflow 上 2013 年時的問題與回答:「How are zlib, gzip and zip related? What do they have in common and how are they different?」。

short form 算是解釋了這三個的關係,另外多帶出了其他相關的格式以及演算法:

.zip is an archive format using, usually, the Deflate compression method. The .gz gzip format is for single files, also using the Deflate compression method. Often gzip is used in combination with tar to make a compressed archive format, .tar.gz. The zlib library provides Deflate compression and decompression code for use by zip, gzip, png (which uses the zlib wrapper on deflate data), and many other applications.

而 long form 則是描述了整段歷史,完整到下面的 comment 有人提到希望能夠提供一些引用的連結,當作 reference:

This post is packed with so much history and information that I feel like some citations need be added incase people try to reference this post as an information source. Though if this information is reflected somewhere with citations like Wikipedia, a link to such similar cited work would be appreciated.

回答的作者 Mark Adler 也在 comment 上很霸氣的回:

I am the reference, having been part of all of that. This post could be cited in Wikipedia as an original source.

翻了一下他的事蹟:

老人家交代了當年的一些脈絡 (像是專利問題),會更容易理解為什麼現在發展成這樣...

AWS 推出 32TB RAM 的機種 u7in-32tb.224xlarge

最近是 AWS re:Invent 2023,會有蠻多消息陸陸續續冒出來。

剛剛看到「Introducing Amazon EC2 high memory U7i Instances for large in-memory databases (preview)」這個,AWS 推出了 u7in-32tb.224xlarge,896 vCPU + 32TB RAM 的機器。

先前最大的應該是 u-24tb1.112xlarge,448 vCPU + 24TB RAM 的機器 (包括 vCPU 數量與記憶體大小),所以這次 vCPU 數量多一倍,記憶體多 50%。

很理所當然的又拿 SAP 來介紹了:

The new U7i instances are designed to support large, in-memory databases including SAP HANA, Oracle, and SQL Server.

比較特別的是 us-east-1 沒有在第一波的名單,反而是 us-west-2 以及其他的區域,不確定當初怎麼決定的:

Powered by custom fourth generation Intel Xeon Scalable Processors (Sapphire Rapids), the instances are now available in multiple AWS regions in preview form, in the US West (Oregon), Asia Pacific (Seoul), and Europe (Frankfurt) AWS Regions, as follows[.]

因為是 preview 階段,可能是這些區域有大客戶有需求,所以先佈到這幾區?