While JSON mode improves model reliability for generating valid JSON outputs, it does not guarantee that the model’s response will conform to a particular schema.
而現在新的 model 可以了:
Today we’re introducing Structured Outputs in the API, a new feature designed to ensure model-generated outputs will exactly match JSON Schemas provided by developers.
新的 model 代碼是 gpt-4o-2024-08-06 這組,而且又降價了:
By switching to the new gpt-4o-2024-08-06, developers save 50% on inputs ($2.50/1M input tokens) and 33% on outputs ($10.00/1M output tokens) compared to gpt-4o-2024-05-13.
This release adds a new human supervised learning ("Human SL") model trained on a large number of human games to predict human moves across players of different ranks and time periods! Not much experimentation with it has been done yet and there is probably low-hanging fruit on ways to use and visualize it, open for interested devs and enthusiasts to try.
從 NVIDIA 這邊的新聞稿列出來的則比較合理,是透過硬體的觀點提到這個 12B model 可以跑在一張 4090 上 (24GB VRAM):
Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.
不過即使可以這樣跑,目前比較有效率的跑法應該是應該都會找 quantization 版本來跑,通常 model 會變小不少,而且損失應該也還在能接受的範圍。
Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model.
If you use zsh and have trouble running llamafile, try saying sh -c ./llamafile. This is due to a bug that was fixed in zsh 5.9+. The same is the case for Python subprocess, old versions of Fish, etc.
On Linux, Nvidia cuBLAS GPU support will be compiled on the fly if (1) you have the cc compiler installed, (2) you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enable GPU, and (3) the CUDA developer toolkit is installed on your machine and the nvcc compiler is on your path.
但可以看到沒有被 offload 到 GPU 上面:
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 4165.47 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
嘗試了不同的方法,發現要跑 sh -c "./llamafile --n-gpu-layers 35",也就是把參數一起包進去,這樣就會出現對應的 offload 資訊,而且輸出也快很多:
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 70.42 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.05 MB
We have just 16GB VRAM to work with, so we likely want to choose a 7B model. Lately, the OpenHermes-2.5-Mistral-7B model is getting some traction so let's go with it.
curl -s http://XXX.XXX.XXX.XXX:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about python exceptions"
}
]
}' | jq