看到「Spread Your Wings: Falcon 180B is here」這個,Falcon 180B 釋出,號稱跟 LLaMA 2 站在同一個平台上,但目前看到的授權不是 open source license,大概就是留個記錄下來,實際上應該就不會去碰...
關於 license 的討論在 Hacker News 上有不少,可以參考:「Falcon 180B (huggingface.co)」。
幹壞事是進步最大的原動力
看到「Spread Your Wings: Falcon 180B is here」這個,Falcon 180B 釋出,號稱跟 LLaMA 2 站在同一個平台上,但目前看到的授權不是 open source license,大概就是留個記錄下來,實際上應該就不會去碰...
關於 license 的討論在 Hacker News 上有不少,可以參考:「Falcon 180B (huggingface.co)」。
在 Hacker News 首頁上看到「Llama.cpp: Full CUDA GPU Acceleration (github.com/ggerganov)」,對應得原頁面在「CUDA full GPU acceleration, KV cache in VRAM #1827」這邊。
裡面是在講 llama.cpp 之前的 GPU 加速還是有不少事情是在 CPU 上面做,這次是把目前 ggml 支援的操作都實作 GPU 版本了:
This PR adds GPU acceleration for all remaining ggml tensors that didn't yet have it. Especially for long generations this makes a large difference because the KV cache is still CPU only on master and gets larger as the context fills up.
蠻多人有不同測試的結果,要注意這次不是把 CPU 搬到 GPU 上面做,而是把本來因為比較 light 而還沒搬上 GPU 的部分搬上去,所以不會是數量級的加速,但看起來改善也已經很不賴了:
Early attempt this morning we're getting ~2.5-2.8x perf increase on 4090s and about 1.8-2x on 3090Ti.
然後 Falcon... 目前看起來還沒有必較好的進展 XD
Hacker News 首頁上看到 Georgi Gerganov 成立公司的計畫:「GGML – AI at the Edge (ggml.ai)」,官網在「GGML - AI at the edge」。
如同 Georgi Gerganov 提到的,llama.cpp 這些專案本來是他的 side project,結果意外的紅起來:
I've started a company: https://t.co/jFknDoasSy
From a fun side project just a few months ago, ggml has now become a useful library and framework for machine learning with a great open-source community
— Georgi Gerganov (@ggerganov) June 6, 2023
另外他提到了 Nat Friedman 與 Daniel Gross 也幫了一把:
I'm incredibly grateful to @natfriedman and @danielgross for the support & funding and also for helping me get inspired even more in this project
There is still a long way ahead with many ideas to try and cool things to do. Hope you will join and help us create something useful!
— Georgi Gerganov (@ggerganov) June 6, 2023
在官網則是有提到是 pre-seed funding:
ggml.ai is a company founded by Georgi Gerganov to support the development of ggml. Nat Friedman and Daniel Gross provided the pre-seed funding.
現在回頭來看,當初 llama.cpp 會紅起來主要是因為 CPU 可以跑 LLaMA 7B,而且用 CPU 跑起來其實也不算慢。
後來吸引了很多人一起幫忙,於是有了不少 optimization (像是「llama.cpp 的載入速度加速」這邊用 mmap 減少需要載入的時間,並且讓多個 process 之間可以重複使用 cache),接下來又有 GPU 的支援...
但不確定他開公司後,長遠的計畫是什麼...?
在 LLM 裡面講的 Open 不是 open-source license 的定義,比較接近「免費使用」而已,通常會帶有限制。
但即使放寬到「免費使用」,LLaMA 65B 從二月放出來 (或者說「被放出來」) 已經領頭領了三個多月了,直到上個禮拜看到被 Falcon 40B 超越的消息:
LLaMa is dethroned 👑 A brand new LLM is topping the Open Leaderboard: Falcon 40B 🛩
*interesting* specs:
- tuned for efficient inference
- licence similar to Unity allowing commercial use
- strong performances
- high-quality dataset also releasedCheck the authors' thread 👇 https://t.co/vojobBXFQT pic.twitter.com/BuOLnHebhU
— Thomas Wolf (@Thom_Wolf) May 26, 2023
在「Open LLM Leaderboard」這邊的 benchmark 可以看到除了 TruthfulQA (0-shot) 以外,其他的都領先,而綜合平均值也是領先的:
而往下拉可以看到 7B 的版本表現也不錯,之後應該也可以再 tune。
更重要的是,剛剛看到這個 model 把授權改成 Apache License 2.0 的消息,這所以 LLaMA 的替代方案總算有樣子了:
The license of the Falcon 40B model has just been changed to… Apache-2 which means that this model is now free for any usage including commercial use (and same for the 7B) 🎉 https://t.co/LZcmejPdf5
— Thomas Wolf (@Thom_Wolf) May 31, 2023
另外看了一下,這包 model 是在 AWS 的 SageMaker 上面幹出來的,翻了一下 Technology Innovation Institute,真不愧是有錢的單位:
Falcon-40B was trained on AWS SageMaker, on 384 A100 40GB GPUs in P4d instances.
The Technology Innovation Institute (TII) is an Abu Dhabi government funded research institution that operates in the areas of artificial intelligence, quantum computing, autonomous robotics, cryptography, advanced materials, digital science,[4] directed energy and secure systems. The institute is a part of the Abu Dhabi Government’s Advanced Technology Research Council (ATRC).
在 Hacker News 上有人已經跑起來了,而且是透過 InstructGPT 調教過的版本:「Falcon 40B LLM (which beats Llama) now Apache 2.0 (twitter.com/thom_wolf)」,據說 4-bit quantized 版本可以在 40GB 的 A100 或是兩張 24GB 的 3090/4090 跑起來。
另外 ggml 的人應該這幾天就會動起來了,可以讓子彈再放著飛一下...
前陣子因為重灌桌機,所以在重建許多環境... 其中一個就是 llama.cpp,連到專案頁面上時意外發現這兩個新的 feature:
OpenBLAS support
cuBLAS and CLBlast support
這代表可以用 GPU 加速了,所以就照著說明試著編一個版本測試。
編好後就跑了 7B 的 model,看起來快不少,然後改跑 13B 的 model,也可以把完整 40 個 layer 都丟進 3060 (12GB 版本) 的 GPU 上:
./main -m models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 40
從 log 可以看到 40 layers 到都 GPU 上面,吃了 7.5GB 左右:
llama.cpp: loading model from models/13B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v2 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 90.75 KB llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state) llama_model_load_internal: [cublas] offloading 40 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 7562 MB llama_init_from_file: kv self size = 400.00 MB
30B 的 model 我也試著丟上去跑,但只能丟 28 layers 上去 (全部是 60 layers),再多 GPU 的記憶體就撐不住了。
但能用 GPU 算是一個很大的進展,現在這版只快了一半的時間,不知道後面還有沒有 tune 的空間...
在 Hacker News 上看到「Llama.cpp 30B runs with only 6GB of RAM now (github.com/ggerganov)」這個消息,原 pull request 在「Make loading weights 10-100x faster #613」這邊。
這個 PR 的作者 Justine Tunney 在 PR 上有提到他改變 model 檔案格式,以便改用 mmap(),大幅降低了需要預先讀取的時間 (因為變成 lazy-loading style),而且這也讓系統可以利用 cache page,避免了 double buffering 的問題:
This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o.
這讓我想到在資料庫領域中,PostgreSQL 也會用 mmap() 操作,有點類似的概念。
另外 Justine Tunney 在這邊的 comment 有提到一個意外觀察到的現象,他發現實際在計算的時候用到的 model 內容意外的少:他用一個簡單的 prompt 測試,發現 20GB 的 30B model 檔案在他的 Intel 機器上實際只用到了 1.6GB 左右:
If I run 30B on my Intel machine:
[...]
As we can see, 400k page faults happen, which means only 1.6 gigabytes ((411522 * 4096) / (1024 * 1024)) of the 20 gigabyte weights file actually needed to be used.
這點他還在懷疑是不是他的修改有 bug,但目前他覺得不太像,也看不太出來:
Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse. Maybe we made some kind of rare mistake where llama.cpp is somehow evaluating 30B as though it were the 7B model. Anything's possible, however I don't think it's likely. I was pretty careful in writing this change, to compare the deterministic output of the LLaMA model, before and after the Git commit occurred. I haven't however actually found the time to reconcile the output of LLaMA C++ with something like PyTorch. It'd be great if someone could help with that, and possibly help us know why, from more a data science (rather than systems engineering perspective) why 30B is sparse.
如果不是 bug 的話,這其實冒出了一個很有趣的訊號,表示這些 model 是有可能再瘦身的?
在「Stanford Alpaca 與 Alpaca.cpp」這邊有提到 Alpaca 有放出 7B 的 model 可以用,後來 community 也照著同樣的方式訓練出對應的 13B 與 30B 的 model 了,在 Alpaca.cpp 專案的 README.md 有寫怎麼取得。
看起來比較大的限制是記憶體,13B 與 30B 的記憶體建議是 >10GB RAM 與 >32GB RAM,不知道剛好 32GB RAM 跑不跑得動,先拿 32GB RAM 的機器跑看看,如果真的不行的話家裡桌機是 64GB RAM,應該是不會有大問題...
不過 13B 的 model 檔案位置是放在 IPFS 上,透過這幾個 gateway 下載起來都有點慢,有需要的人用 BitTorrent 可能會快一點?
Update:測了一下 13B 的版本:
> Explain FAANG.
FAANG refers to five of America's largest technology companies - Facebook, Apple Inc., Amazon, Netflix and Google (Alphabet). These are some of the most valuable brands in today’s marketplace with a combined value that exceeded $3 trillion as at 2018.
以及 30B 的版本:
> Explain FAANG.
FAANG stands for Facebook, Amazon, Apple, Netflix and Google - five of the most powerful technology companies in today's world. These tech giants have become increasingly influential over recent years due to their dominance in various markets such as social media platforms (Facebook), e-commerce websites (Amazon) or streaming services (Netflix).
先前提到的「玩最近 Facebook Research (Meta) 放出來的 LLaMA」,在 2/24 公佈後,過沒多久 Stanford CRFM 團隊就丟出 fine-tuned 過的版本:「Alpaca: A Strong, Replicable Instruction-Following Model」,對應的討論在「Alpaca: A strong open-source instruction-following model (stanford.edu)」。
這個 tuning 是基於 LLaMA-7B:
We are releasing our findings about an instruction-following language model, dubbed Alpaca, which is fine-tuned from Meta’s LLaMA 7B model. We train the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003, but is also surprisingly small and easy/cheap to reproduce.
後來就再被包成一個專案 (改自 llama.cpp),讓大家可以很方便的跑:「Alpaca.cpp」。
剛好可以比較,用 llama.cpp 跑的話是:
$ ./main -m ./models/7B/ggml-model-q4_0.bin -t 12 -n 128 -p 'Explain FAANG. ' Explain FAANG. 🙂 The acronym “FAAN” is used for the five tech giants (Facebook, Apple, Amazon and Netflix) that are taking over our lives: 15 November 2018 / by Amaraksha S [end of text]
而用 alpaca.cpp 跑的話回答品質好不少:
> Explain FAANG. FAANG (or FANNG) refers to five technology companies, Facebook Inc., Apple Inc., Amazon.com Inc., Netflix Inc and Alphabet's Google division that have become some of the most influential in recent years due to their rapid growth rates as well as their market capitalization values: Facebook (FB), $596 billion; Apple (AAPL) ,$1,084.72 billion; Amazon (AMZN), $394.5 billion ; Netflix( NFLX ), $154.8 billion and Google's parent company Alphabet Inc., ($GOOG,$GOOGL). These companies have become some of the most influential in recent years due to their rapid growth rates as well as their market capitalization values: Facebook (FB),$307.15 billion; Apple(AAPL) , $924.86 billion ; Amazon (AMZN ), $783.
可以看到 instruction-following 後的差異,很像是 GPT-3.5 API 與 ChatGPT API 的輸出結果差異。
而這只是 LLaMA-7B 的成果,好像可以期待有人對 LLaMA-65B 做些事情?
很多地方應該都有提到 Facebook Research (Meta) 放出來的 LLaMA 了,對應的論文是「LLaMA: Open and Efficient Foundation Language Models」這篇,但這邊論文提到的 open 並不是一般常見的 open 定義,而只是常見的行銷詞彙而已,實際上只是 free for charging with constraints。
另外要注意 LLaMA 是個 LLM 而已,跟 ChatGPT 不算是同樣性質的東西,能對比應該是 GPT-3 (或是 GPT-3.5)。
主要是 ChatGPT 多了 SL 與 RL 的步驟,而產出來的東西更接近商業化產品要的結果。
LLaMA 的特點在於效能不錯,可以用 LLaMA-13B 打贏 GPT-3 (175B),另外這次訓練出來最大的 LLaMA-65B 則可以站上第一梯隊 (與 DeepMind 的 Chinchilla-70B 與 Google Research 的 PaLM-540B):
LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B.
但跟以前差異最大的是,這次 Facebook Research 決定把訓練完後的 model 檔案放出來,所以就有了後續很多的進展:
We release all our models to the research community.
首先一開始 Facebook Research 要求使用者填表單才提供下載 (2/24 的時候),但三月初的時候 GitHub 上有人直接把 BitTorrent 的 magnet 連結附上去,送了一個 pull request:「Save bandwidth by using a torrent to distribute more efficiently #73」,所以你就有「方法」可以取得 model 檔案,但還是可以注意一下使用限制:
To maintain integrity and prevent misuse, we are releasing our model under a noncommercial license focused on research use cases. Access to the model will be granted on a case-by-case basis to academic researchers; those affiliated with organizations in government, civil society, and academia; and industry research laboratories around the world. People interested in applying for access can find the link to the application in our research paper.
除了可以透過 BitTorrent 下載外,comment 裡面也有 IPFS 的連結可以下載。
有需要下載這包檔案的人要注意檔案很大,大約 240GB,其中 65B model 佔了 128GB 左右。
被放出來以後就開始有很多人在上面包起來用,其中目前比較完整的應該是「Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp」這篇提到的方法,裡面提到的 ggerganov/llama.cpp 專案同時支援了 Windows、macOS 與 Linux,而且可以用 CPU 跑,速度也不慢。
試著用 llama.cpp 跑,65B 的 model 在家裡桌機有 64GB RAM 的情況下是可以應硬扛的,跑降到 4bits 的 model 大約吃 41GB RAM 左右。
比較驚訝的是 efficient 這個部份,拿 CPU 跑 65B 版本是跑得動的。
隨便丟個「文字接龍」進去 LLaMA-65B 讓他接,可以看到接出來的東西還可以:
main: prompt: 'The main difference between javascript and java are '
main: number of tokens in prompt = 10
[...]The main difference between javascript and java are 1)Java is a compiled language while JavaScript
is not. So that the performance of Javascript code can be lower than Java Code, but they run on different platforms: browser vs server or local machine;
2)'javac' to compile .java file to bytecode(for JVM), and 'jar'(to make executable jar files for client systems).
3)Java uses more memory (RAM) that javascript. So the Javascript code is generally small than Java
codes.(A web application written in JavaScript may run faster on a PC with 64 MB of RAM compared to one running using applets.) While an advantage offered by J
如果是 LLaMA-7B 的話會快很多,但回答就有點微妙了:
main: prompt: 'The main difference between javascript and java are '
main: number of tokens in prompt = 10
[...]The main difference between javascript and java are 1) Javascript is interpreted, while Java compiles to native machine code. (Compiled languages have a performance advantage over Interpreted ones
.)
2) The language standardization process for JavaScript has not been completed at the time of this writing which leaves it open to being hacked or changed by those who care enough about your website to do so... In Java, everything is set in stone. (Or at least as much code that can be shipped on a CD-ROM disk)
Sorry if my response was not clear - but you're right! I think that "2" above should really have said: "(
訓練所花的資源的部份,可以從論文裡面看到,如果是 2048 張 A100 的話大約要跑五個月 (照這個語氣,實際上大概不是這個數字):
Finally, we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models.
另外也有列出 GPU hours 可以參考: