GitHub 上有大量重複的程式碼...

扣除掉 fork 的程式碼後,研究人員在 GitHub 上還是發現有大量重複的程式碼:「DéjàVu: a map of code duplicates on GitHub」。

This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 482 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files.

Java/C++/Python/JavaScript 寫的 4.5M 個專案有 482M 個檔案,但只有 85M 個檔案是不一樣的 XD

想一想其實也是... 現在愈來愈多工具產生程式碼了 XD (i.e. Scaffold)

Amazon Athena 可以透過 ODBC 連接了

Amazon Athena 支援 ODBC 了 (先前直接連結只支援 JDBC):「Amazon Athena adds support for querying data using an ODBC driver」。

With the availability of a new ODBC driver, you can now connect popular business intelligence tools to Athena. This allows you to report and visualize all of your data in S3 with the tools of your choice. In addition to the ODBC driver, Customers can now connect to Amazon Athena using a JDBC driver, an API and via the AWS Console.

這讓非 Java 的程式語言可以更方便的接上去了,像是 PHPPDO 支援 ODBC 但不支援 JDBC,要用就得想其他辦法:「PHP: PDO Drivers - Manual」。

IBM 把自家的 JVM 貢獻出來:Eclipse OpenJ9

IBM 把自家的 JVM 貢獻出來,與 Eclipse 合作,一起推出了 OpenJ9 (原名 IBM J9)。程式碼可以在 GitHub 上的 eclipse/openj9 取得。

在官網上有提到與官方版本不同的特性:

Low memory footprint. Fast start-up time. High application throughput. Optimized to run Java applications cost-effectively in the cloud.

應該會有人包 PPA 吧,之後跑 Java 程式可以拿來用看看...

zetcd:用 etcd 跑 ZooKeeper 架構

在「zetcd: running ZooKeeper apps without ZooKeeper」這邊介紹了用 etcd 當作 ZooKeeper 伺服器。程式碼在「Serve the Apache Zookeeper API but back it with an etcd cluster」這邊可以看到。

不過可以看到有不少 overhead:

但 etcd 用 Go 寫 (省下 JVM tuning?),可能是個不錯的誘因...

印 "#" 比印 "B" 來的快的問題

這篇是兩年前在 StackOverflow 上的問題:「Why is printing “B” dramatically slower than printing “#”?」。

問問題的人這段程式跑了 8.52 秒:

Random r = new Random();
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 1000; j++) {
        if(r.nextInt(4) == 0) {
            System.out.print("O");
        } else {
            System.out.print("#");
        }
    }

   System.out.println("");
 }

而把上面的 # 換成 B 就變成 259.152 秒。

答案是與 word-wrapping 有關:

Pure speculation is that you're using a terminal that attempts to do word-wrapping rather than character-wrapping, and treats B as a word character but # as a non-word character. So when it reaches the end of a line and searches for a place to break the line, it sees a # almost immediately and happily breaks there; whereas with the B, it has to keep searching for longer, and may have more text to wrap (which may be expensive on some terminals, e.g., outputting backspaces, then outputting spaces to overwrite the letters being wrapped).

But that's pure speculation.

這真是細節 XDDD

商業版本的 Zing JVM 對 GC 的改善

在翻「Stuff The Internet Says On Scalability For February 19th, 2016」這邊的資料時看到這篇文章講到 Zing JVM 大幅降低了他們在 C10M 時遇到的 latency 問題:「Fast C10M: MigratoryData running on Zing JVM achieves near 1 Gbps messaging to 10 million concurrent users with 15 millisecond consistent latency」。

包括了平均值、99% 值、最大值都大幅下降:

In this post, we show that by simply replacing the JVM with Zing JVM out-of-the-box (without any tuning), and preserving the same C10M benchmark scenario and setup, we can reduce the average latency from 61 milliseconds to under 15 milliseconds. Moreover, and more importantly, the latency spikes can be significantly reduced from 585 milliseconds to 25 milliseconds for the 99th percentile latency and from 1700 milliseconds to 126 milliseconds for the maximum latency. Therefore, every single message can be delivered, even in the worst case, with almost no delay.

而且他們發現最高的 126ms 也不是 GC 造成的,而是 benchmark 這邊造成的:

And so, the relatively high latency spikes we saw in the previous C10M benchmark were due to JVM’s Garbage Collection (GC). In the new benchmark, not only Zing JVM didn’t introduce high latency spikes, but based on analyzing the logs, it appears that GC effects no longer dominate latency behavior. The dramatically improved 126 ms max latency is not caused by GC but by other condition of the benchmark setup. Anyway, this max latency was so small for a web architecture that we did not spend time to determine at which level exactly it occurred.

Zing 的價錢是一台 USD$8000/year,另外也有虛擬機版本 (另外報價),如果遇到 JVM GC 問題,看起來會是個可以花錢解決問題的方案:

Zing® is priced on a subscription basis per server. With per-server pricing, you don’t need to worry about core counts, memory size, or number of instances deployed per server. The annualized subscription price for Zing per physical server ranges from $8000 (for a single license) to under $2000 (for orders above 1000 servers). Higher volumes and longer subscription terms will reduce the per-server price for Zing. Pricing for virtual servers is also available upon request.

ScyllaDB:用 C++ 改寫相容於 Cassandra 的系統

Scylla 是出自希臘神話,維基百科對應的連結:「斯庫拉」、「Scylla」。而在 ScyllaDB 官網副標題寫著:

Fully compatible with Apache Cassandra at 10x the throughput and jaw dropping low latency

JVM 的 GC 老問題在 Cassandra 中帶來的 latency 不穩定本來就是個痛苦的問題,要花很多力氣去調整,而用 C++ 改寫等於是自己處理這一塊。

這帶來的效能提昇可以從各種測試結果看出來,像是單機的測試:「Scylla vs. Cassandra benchmark」,以及多機的測試:「Scylla vs. Cassandra benchmark (cluster)」(可以參考下圖)。

而 Latency 的改善也是極為明顯:「Latency benchmark」。

其中另外一個重要的技術是 IntelDPDK,可以大幅降低現有 Linux Kernel 在網路架構上的損耗:「Dedicated fast network stack for modern hardware」。

很有趣的專案,好久沒碰 Cassandra 了...

GitHub 上程式語言的趨勢

GitHub 給了從 2008 年到 2015 年現在,放在 GitHub 上專案所使用程式語言的排名:「Language Trends on GitHub」。

這同時包括了公開與私人 repository:

The rank represents languages used in public & private repositories, excluding forks, as detected by Linguist.

可以看到 Java 專案的排名逐步上升,應該是愈來愈多 Java 專案放到 GitHub 上 (應該是跟 Android 有關)。而 Perl 是掉出去很久了,PHP 則是萬年不動... XD

Facebook 推出靜態分析工具:Facebook Infer

Facebook 推出了靜態分析工具 Facebook Infer,可以事先找出 AndroidiOS 上的 bug:Open-sourcing Facebook Infer: Identify bugs before you ship

從官方給的操作動畫中就可以看出來怎麼跑了。目前看起來支援三種程式語言,C、Objective-C、Java:

Facebook Infer is a static analysis tool - if you give Infer some Objective-C, Java, or C code, it produces a list of potential bugs.

在 Android 上 (Java) 會找出的類型:

Infer reports null pointer exceptions and resource leaks in Android and Java code.

iOS 上則只找 memory leak:

In addition to this, it reports memory leak problems in iOS and C code.

比較特別的是,這個工具是用 OCaml 寫:

Infer is a static analysis tool for Java, Objective-C and C, written in OCaml.