Georgi Gerganov 寫了一篇怎麼在 AWS 上面用 GPU instance 跑 llama.cpp 的說明:「Using llama.cpp with AWS instances #4225」。
先跳到最後面的懶人套件,直接提供了 shell script 幫你弄完:
bash -c "$(curl -s https://ggml.ai/server-llm.sh)"
回到開頭的部分,機器的選擇上面,他選了一台最便宜的 4 vCPU + 16GB RAM + 16GB VRAM 的機器來跑。
然後他提到了 OpenHermes-2.5-Mistral-7B 這個模型最近很紅,也許有機會看一下:
We have just 16GB VRAM to work with, so we likely want to choose a 7B model. Lately, the OpenHermes-2.5-Mistral-7B model is getting some traction so let's go with it.
用 llama.cpp 裡面的 server
跑起 API server:
./server -m models/openhermes-7b-v2.5/ggml-model-q4_k.gguf --port 8888 --host 0.0.0.0 --ctx-size 10240 --parallel 4 -ngl 99 -n 512
接著就可以用 cURL 測試:
curl -s http://XXX.XXX.XXX.XXX:8888/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer no-key" \ -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests." }, { "role": "user", "content": "Write a limerick about python exceptions" } ] }' | jq
都包好了...