AWS 自家研發晶片的 trn1.*
上線了:「Amazon EC2 Trn1 Instances for High-Performance Model Training are Now Available」。
先前三家雲端的廠商只有 Google Cloud Platform 有 TPU 可以 train & evaluate,現在 AWS 推出 AWS Trainium,補上 train 這塊的產品。其中官方宣稱可以比 GPU 架構少 50% 的計算成本:
Trainium-based EC2 Trn1 instances solve this challenge by delivering faster time-to-train while offering up to 50% cost-to-train savings over comparable GPU-based instances.
然後 PyTorch 與 TensorFlow 都有支援:
The Neuron plugin natively integrates with popular ML frameworks, such as PyTorch and TensorFlow.
另外用 neuron-ls
可以看到 Neuron 裝置的資訊,不過沒看懂為什麼要 mask 掉 private ip 的資訊:
大型的 cluster 會使用 Amazon FSx for Lustre 整合提供服務:
For large-scale model training, Trn1 instances integrate with Amazon FSx for Lustre high-performance storage and are deployed in EC2 UltraClusters. EC2 UltraClusters are hyperscale clusters interconnected with a non-blocking petabit-scale network.
但第一波開放的區域有點少,只有萬年美東一區 us-east-1
與美西二區 us-west-2
:
You can launch Trn1 instances today in the AWS US East (N. Virginia) and US West (Oregon) Regions as On-Demand, Reserved, and Spot Instances or as part of a Savings Plan.
在 us-east-1
上 trn1.2xlarge
的價錢是 US$1.34375/hr,但沒有實際跑過比較好像沒辦法評估到底行不行...
但總算是擺出個產品對打看看,畢竟要夠大才能去訂製這些東西。