在 C 裡 Concurrency 的 Library

看到「libdill: Structured Concurrency for C」這個東西,在 C 裡實作了兩個不同種類的 concurrency,一個是 proc (process-based) 一個是 go (corouting-based)。

支援的 function 算是蠻清晰的,範例也很清楚:

#include <libdill.h>
#include <stdio.h>
#include <stdlib.h>

coroutine int worker(const char *text) {
    while(1) {
        printf("%s\n", text);
        msleep(now() + random() % 500);
    }
    return 0;
}

int main() {
    go(worker("Hello!"));
    go(worker("World!"));
    msleep(now() + 5000);
    return 0;
}

也有 channel 的觀念可以用,之後需要寫玩具的話應該是個好東西...

AlphaGo 不是使用 GPU 加速...

Google 今天公佈的資料中說明了 AlphaGo 不是用一般常見的 GPU 加速運算:「Google supercharges machine learning tasks with TPU custom chip」。

這是特別為 TensorFlow 製作的 ASIC:

The result is called a Tensor Processing Unit (TPU), a custom ASIC we built specifically for machine learning — and tailored for TensorFlow.

而 AlphaGo 用的版本是 TPU 版:

AlphaGo was powered by TPUs in the matches against Go world champion, Lee Sedol, enabling it to "think" much faster and look farther ahead between moves.

放 AlphaGo 的機櫃長這樣:

通常 ASIC 特製的版本會比 FPGA 或是 GPU 快上許多,這代表目前這些沒有大公司撐腰的圍棋軟體要跟 AlphaGo 拼,除非演算法上有重大的突破,不然就得用更大量的設備跟他換...

谷李五番棋今天開打

GoogleDeepMind 所研發出來的 AlphaGo李世乭的「谷李五番棋」將在今天開打。

中國規則、兩個小時、一分鐘讀秒:

The matches will be played under Chinese rules with a komi of 7.5 (the compensation points the player who goes second receives at the end of the match). Each player will receive two hours per match with three lots of 60-second byoyomi (countdown periods after they have finished their allotted time).

將在韓國時間下午一點開賽,對我們也就是十二點開賽:

The matches will be held at the Four Seasons Hotel, Seoul, South Korea, starting at 1pm local time (4am GMT; day before 11pm ET, 8pm PT) on March 9th, 10th, 12th, 13th and 15th.

將會有大量的媒體講解直播,毫無疑問的,YouTubeDeepMind 這個頻道會有直播,目前看起來是早上的十一點半就會開始了。

其他的頻道,台灣已知的有:

不過我應該會到處看吧,中國的圍棋網站「围棋TV」也會在十一點半開始直播。

很多人都有猜測勝負,但自從去年十月贏了樊麾後,不知道成長了多少。其實都是在資訊不足的情況下猜測,在這種情況下,人類大獲全勝或是電腦大獲全勝都不意外...

也因此,五番棋的第一盤應該是最轟動的,因為可以看出 AlphaGo 長到什麼程度... 不過聽說第一盤 Google 只會拿單機版出來應戰?所以如果輸的很慘的話就會拿雲端版來戰?

啊啊啊我好想看啊...

Go 1.6 把 HTTP/2 變成預設支援的功能

Go 的官方公告「Go 1.6 is released」提到了把 net/http 的 HTTP/2 預設啟用了:

In Go 1.6, support for HTTP/2 is enabled by default for both servers and clients when using HTTPS, bringing the benefits of the new protocol to a wide range of Go projects, such as the popular Caddy web server.

另外值得一提的是 sort 演算法的效能改善:

The algorithm inside sort.Sort was improved to run about 10% faster, but the change may break programs that expect a specific ordering of equal but distinguishable elements.

這應該算還蠻基本常用到的東西,會改善很多程式效能...

Google 的 Load balancer:Seesaw

前幾天因為流感而睡太多,來消化一些文章。

上個星期 Google 放出一套用 Go 寫的 Load balancer,叫 Seesaw:「Seesaw: scalable and robust load balancing」。

比較有趣的是 BGP 與 anycast VIP 的能力:

Seesaw v2 provides full support for anycast VIPs - that is, it will advertise an anycast VIP when it becomes available and will withdraw the anycast VIP if it becomes unavailable.

Google 這個規模玩的是不同 scale 的花樣...

最近電腦圍棋的兩個突破...

昨天先看到 Mark Zuckerberg 丟出來的資訊,比較完整的資料可以在「Better Computer Go Player with Neural Network and Long-term Prediction」這邊看到。

Facebook 這邊的成就在於用 DCNN-based model 改善效率,可以用一台電腦 (包括 GPU) 維持在 KGS 5d (業餘五段),跟 Zen19 差不多的棋力,跟目前最好的電腦圍棋軟體棋力差不多:

darkfmcts3now holds a stable KGS 5d level, on par with the top Go AIs, has beaten Zen19 onceand hold 1win/1lose against a Korean 6p professional player with 4 handicaps.

The distributed version, named darkfmcts3in KGS Go Server, use darkfores2 as the underlying DCNN model, runs 75,000 rollouts on 2048 threads and produces a move every 13 seconds with one Intel Xeon E5-2680 v2 at 2.80GHz and 44 NVidia K40m GPUs.

In this paper, we have substantially improved the performance of DCNN-based Go AI, extensively evaluated it against both open source engines and strong amateur human players, and shown its potentials if combined with Monte-Carlo Tree Search (MCTS).

而隔壁棚的 Google 也丟出對應的研究成果「AlphaGo: Mastering the ancient game of Go with Machine Learning」,直接往 Nature 上丟:「Mastering the game of Go with deep neural networks and tree search」。

Google 不是跑在單機上,而是跑在 Google Cloud Platform 上,分散式的版本大約是職業五段的水準:

Of course, all of this requires a huge amount of compute power, so we made extensive use of Google Cloud Platform, which enables researchers working on AI and Machine Learning to access elastic compute, storage and networking capacity on demand.

實際對樊麾 (職業二段) 的成績也是五比零的大獲全勝:

So we invited the reigning 3-time European Go champion Fan Hui — an elite professional player who has devoted his life to Go since the age of 12 — to our London office for a challenge match. The match was played behind closed doors between October 5-9 last year. AlphaGo won by 5 games to 0 -- the first time a computer program has ever beaten a professional Go player.

而今年預定要去打大魔王,目前韓國圍棋第一人李世乭

AlphaGo’s next challenge will be to play the top Go player in the world over the last decade, Lee Sedol. The match will take place this March in Seoul, South Korea. Lee Sedol is excited to take on the challenge saying, "I am privileged to be the one to play, but I am confident that I can win." It should prove to be a fascinating contest!

到時候中國那邊的圍棋節目應該會有網路直播可以看吧,再來盯...

Go 1.5 的進展

Andrew Gerrand 在「The State of Go - Where we are in May 2015」這份投影片裡面提到了不少 1.5 的改變與改善,預定在今年八月釋出。

首先是全部都改用 Go 寫,不再需要 C 語言的協助了:

The gc tool chain has been converted from C to Go.

而效能上的改善最大的是 GC 的部份:

另外是對行動平台的發展:

Go 1.5 provides support for Android and experimental support for iOS.

這樣變得頗有趣的,自家的 Android 有打算換掉 Java 嗎?

CloudFlare 對 Go 上面加解密系統的改善

CloudFlare 發佈了自己版本的 Go,修改了其中的 crypto subsystem:「Go crypto: bridging the performance gap」。

文章花了不少篇幅介紹 AEAD (Authenticated Encryption with Associated Data),而目前 CloudFlare 支援的是 AES-GCM 與 ChaCha20-Poly1305,也是兩大主流,分別佔了 60% 與 10% 的 HTTPS 流量:

As such today more than 60% of our client facing traffic is encrypted with AES-GCM, and about 10% is encrypted with ChaCha20-Poly1305.

CloudFlare 提供的改進使得速度快很多,並且有考慮到 side-channel attack 的問題:

Both implementations are constant-time and side-channel protected.

                         CloudFlare          Go 1.4.2        Speedup
AES-128-GCM           2,138.4 MB/sec          91.4 MB/sec     23.4X

P256 operations:
Base Multiply           26,249 ns/op        737,363 ns/op     28.1X
Multiply/ECDH comp     110,003 ns/op      1,995,365 ns/op     18.1X
Generate Key/ECDH gen   31,824 ns/op        753,174 ns/op     23.7X
ECDSA Sign              48,741 ns/op      1,015,006 ns/op     20.8X
ECDSA Verify           146,991 ns/op      3,086,282 ns/op     21.0X

RSA2048:
Sign                 3,733,747 ns/op      7,979,705 ns/op      2.1X
Sign 3-prime         1,973,009 ns/op      5,035,561 ns/op      2.6X

AES-GCM 與 EC 的改善主要是利用 CPU 指令集加速:

[T]herefore we created assembly implementations of Elliptic Curves and AES-GCM for Go on the amd64 architecture, supporting the AES and CLMUL NI to bring performance up to par with the OpenSSL implementation we use for Universal SSL.

不過 RSA 的部份雖然幅度也不小 (比起 AES 與 EC 的部份是不大,不過純就數字來看,快了一倍多也不少),不過沒提到怎麼改善,只稍微帶過:

In addition the fork includes small improvements to Go's RSA implementation.

MILL:在 C 裡面實作 Go-style 的 concurrency

看到「Go-style concurrency in C」這個專案,在 C 上實作 Go-style 的 concurrency,包括了 channel 的設計。原始程式碼可以在 GitHub 上的「sustrik/mill」看到。

在「mill.c」可以看到實作細節,另外也可以看到 yield() 的設計。

不過目前還很早期,請小心服用:

This is a proof of concept project that seems to work with x86-64, gcc and Linux. I have no idea about different environments. Also, the project is in very early stage of development and not suitable for actual usage.

用 Go 寫的 Tor Relay Server

Zite 上看到的「Implementing a Tor relay from scratch」,用 Go 寫的 Tor Relay Server。

會跳下去用 Go 寫是因為效能上的考量:

[...], but the lack of AES-NI instructions on the CPUs cause a significant slowdown.

但因為一個 IP 只能跑兩個 instance,這就有點痛了:

To maximize the amount of relayed data, it is normal to simply run multiple instances of the program, up to two per IP address.

而作者的目標是超過現有的極限:

My final goal was to beat the Tor speed record, which was at roughly 200 megabytes per second.

成果就是直接吃滿 2Gbps (250MBytes/sec),而且 CPU 只用了 60%:

[...] He set up a server with my relay and within a few days we had broken the Tor speed record with a nice 250 megabytes per second, effectively maxing out the network link. CPU usage was at a nice 60% across 12 cores. But his relay also suffered from the memory issues and had to be restarted every few days.

作者的程式碼放在 GitHub 上,之後應該會有人跳進去改寫:「A fast implementation of the Tor OR code, in Go」。