The first serious run of KataGo ran for 7 days in Februrary 2019 on up to 35xV100 GPUs. This is the run featured in the paper. It achieved close to LZ130 strength before it was halted, or up to just barely superhuman.
Following some further improvements and much-improved hyperparameters, KataGo performed a second serious run in May-June a max of 28xV100 GPUs, surpassing the February run after just three and a half days. The run was halted after 19 days, with the final 20-block networks reaching a final strength slightly stronger than LZ-ELFv2! (This is Facebook's very strong 20-block ELF network, running on Leela Zero's search architecture). Comparing to the yet larger Leela Zero 40-block networks, KataGo's network falls somewhere around LZ200 at visit parity, despite only itself being 20 blocks.
從論文裡面可以看到,跟 Leela Zero 一樣是逐步提昇 (應該也是用 Net2Net),而不是一開始就拉到 20x256:
In KataGo’s main 19-day run, (b, c) began at (6, 96) and switched to (10, 128), (15, 192), and (20, 256), at roughly 0.75 days, 1.75 days, and 7.5 days, respectively. The final size approximately matches that of AlphaZero and ELF.
訓練速度上會有這麼大的改善,分成兩個類型,一種是一般性的 (在「Major General Improvements」這章),另外一類是特定於圍棋領域的改進 (在「Major Domain-Specific Improvements」這章)。
在 Leela Zero 的 issue tracking 裡面也可以看到很多關於 KataGo 的消息,看起來作者也在裡面一起討論,應該會有一些結果出來...
沒有太意外是使用 Leela Zero + Lizzle,畢竟這是 open source project,在軟體與資料的取得上相當方便,而且在好的硬體上已經可以超越人類頂尖棋手。
由於在 Lizzle 的介面上可以看到勝率,以及 Leela Zero 考慮的下一手 (通常會有多個選點),而且當游標移到這些選點上以後,還會有可能的變化圖可以看,所以對於棋手在熟悉操作介面後,可以很快的擺個變化圖,然後讓 Leela Zero 分析後續的發展,而棋手就可以快速判斷出「喔喔原來是這樣啊」。
KAISER will affect performance for anything that does system calls or interrupts: everything. Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt. Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical. The worst we have seen is a roughly 30% regression on a loopback networking test that did a ton of syscalls and context switches.
With these VM results so far it's still a far cry from the "30%" performance hit that's been hyped up by some of the Windows publications, etc. It's still highly dependent upon the particular workload and system how much performance may be potentially lost when enabling page table isolation within the kernel.
If you're using Linux and you configure udev to grant access to the vendor ID & product ID of the token as it appears normally, nothing will work because the vendor ID and product ID are different when it's active. The Chrome extension will get very confused about this.
ASN.1 DER is designed to be a “distinguished” encoding, i.e. there should be a unique serialisation for a given value and all other representations are invalid. As such, numbers are supposed to be encoded minimally, with no leading zeros (unless necessary to make a number positive). Feitian doesn't get that right with this security key: numbers that start with 9 leading zero bits have an invalid zero byte at the beginning. Presumably, numbers starting with 17 zero bits have two invalid zero bytes at the beginning and so on, but I wasn't able to press the button enough times to get such an example. Thus something like one in 256 signatures produced by this security key are invalid.
With this device, I can't test things like key handle mutability and whether the appID is being checked because of some odd behaviour. The response to the first Check is invalid, according to the spec: it returns status 0x9000 (“NO_ERROR”), when it should be 0x6985 or 0x6a80. After that, it starts rejecting all key handles (even valid ones) with 0x6a80 until it's unplugged and reinserted.
This device has the same non-minimal signature encoding issue as the Feitian ePass. Also, if you click too fast, this security key gets upset and rejects a few requests with status 0x6ffe.
A 1KiB ping message crashes this device (i.e. it stops responding to USB messages and needs to be unplugged and reinserted). Testing a corrupted key handle also crashes it and thus I wasn't able to run many tests.
The Key-ID (and HyperFIDO devices, which have the same firmware, I think) have the same non-minimal encoding issue as the Feitian ePass, but also have a second ASN.1 flaw. In ASN.1 DER, if the most-significant bit of a number is set, that number is negative. If it's not supposed to be negative, then a zero pad byte is needed. I think what happened here is that, when testing the most-significant bit, the security key checks whether the first byte is > 0x80, but it should be checking whether it's >= 0x80. The upshot is the sometimes it produces signatures that contain negative numbers and are thus invalid.
It turned out that in some unusual circumstances, which I’ll detail below, our edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data.
不過這邊不包括 SSL 的 key,主要是因為隔離開了:
For the avoidance of doubt, Cloudflare customer SSL private keys were not leaked. Cloudflare has always terminated SSL connections through an isolated instance of NGINX that was not affected by this bug.
不過由於這些敏感資料甚至還被 Google 收進 search engine,算是相當的嚴重,所以不只是 Cloudflare 得修好這個問題,還得跟眾多的 search engine 合作將這些資料移除:
Because of the seriousness of such a bug, a cross-functional team from software engineering, infosec and operations formed in San Francisco and London to fully understand the underlying cause, to understand the effect of the memory leakage, and to work with Google and other search engines to remove any cached HTTP responses.
bug 影響的時間從 2016/09/22 開始:
2016-09-22 Automatic HTTP Rewrites enabled 2017-01-30 Server-Side Excludes migrated to new parser 2017-02-13 Email Obfuscation partially migrated to new parser 2017-02-18 Google reports problem to Cloudflare and leak is stopped
The greatest period of impact was from February 13 and February 18 with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage (that’s about 0.00003% of requests).
I worked with cloudflare over the weekend to help clean up where I could. I've verified that the original reproduction steps I sent cloudflare no longer work.
事發後完整的時間軸:
2017-02-18 0011 Tweet from Tavis Ormandy asking for Cloudflare contact information 2017-02-18 0032 Cloudflare receives details of bug from Google 2017-02-18 0040 Cross functional team assembles in San Francisco 2017-02-18 0119 Email Obfuscation disabled worldwide 2017-02-18 0122 London team joins 2017-02-18 0424 Automatic HTTPS Rewrites disabled worldwide 2017-02-18 0722 Patch implementing kill switch for cf-html parser deployed worldwide 2017-02-20 2159 SAFE_CHAR fix deployed globally 2017-02-21 1803 Automatic HTTPS Rewrites, Server-Side Excludes and Email Obfuscation re-enabled worldwide