把 HDFS 的效能瓶頸 metadata server 的資料改到 NewSQL 上使得效能大幅提昇:「HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases」。
In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ single node in-memory metadata service, with a distributed metadata service built on a NewSQL database.
尤其是在 Spotify 的測試,有 16~37 倍的改善 (應該是指碰到 HDFS 時的這塊,像是從外部拉到 HDFS 上的分析,而非整體的效率改善):
Metadata capacity has been increased to at least 37 times HDFS’ capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS.
論文裡面有提到用的是 MySQL Cluster 的 NDB (in-memory):
HopsFS stores all metadata normalized in a highly available, in-memory, distributed, relational database called Network Database (NDB), a NewSQL storage engine for MySQL Cluster.
這樣應該會讓 Hadoop 的人有改善方向...