不打開書直接掃描內容

MIT Media Lab 弄出個好玩的東西,可以不打開書直接掃描書的內容:「Can computers read through a book page by page without opening it?」,主標題是「Terahertz time-gated spectral imaging for content extraction through layered structures」。

用 100Ghz 到 3Thz 的電磁波掃描:

In our new study we explore a range of frequencies from 100 Gigahertz to 3 Terahertz (THz) which can penetrate through paper and many other materials.

先前也有類似的方法,用 X-ray 或是超音波,但效果都不好:

Can’t X-ray or ultrasound do this? It may seem that X-ray or ultrasound can also image through a book; however, such techniques lack the contrast of our THz approach for submicron pen or pencil layers compared next to blank paper. These methods have additional drawbacks like cost and ionizing radiation. So while you might be able to hardly detect pages of a closed book if you use a CT scan, you will not be able to see the text. Ultrasound does not have the resolution to detect 20 micron gaps in between the pages of a closed book -distinguishing the ink layers from the blank paper is out of the question for ultrasound. Based on the paper absorption spectrum, we believe that far infrared time resolved systems and THz time domain systems might be the only suitable candidates for investigating paper stacks page by page.

不知道可以進展做到什麼程度,目前只是「能看懂」的程度,品質看起來還是不太夠:

PostgreSQL 對 Vacuum 效能的改善

在「No More Full-Table Vacuums」這邊提到了 PostgreSQL 在 vacuum 時效能的大幅改善,尤其是大型資料庫在 vacuum 時需要對整個表格從頭到尾掃一次以確保 transaction id 的正確性:

Current releases of PostgreSQL need to read every page in the database at least once every 2 billion write transactions (less, with default settings) to verify that there are no old transaction IDs on that page which require "freezing".

這動作在資料量大的機器上就會吃大量資源導致各種討厭的現象:

All of a sudden, when the number of transaction IDs that have been consumed crosses some threshold, autovacuum begins processing one or more tables, reading every page. This consumes much more I/O bandwidth, and exerts much more cache pressure on the system, than a standard vacuum, which reads only recently-modified page.

而作者送了 patch 改成只會讀還沒搞定的部份:

Instead of whole-table vacuums, we now have aggressive vacuums, which will read every page in the table that isn't already known to be entirely frozen.

要注意的是,agreesive vacuum 相較於 vacuum 會多吃很多資源,但可以打散掉 (有點像一次大 GC 導致 lag 與多次 minor GC 讓程式反應時間變得比較順暢的比較):

An aggressive vacuum still figures to read more data than a regular vacuum, possibly a lot more. But at least it won't read the data that hasn't been touched since the last aggressive vacuum, and that's a big improvement.

這個功能預定在 PostgreSQL 9.6 出現,不知道會不會變 default...

PostgreSQL 9.5 將會有 Parallel Sequential Scan

在「Parallel Sequential Scan is Committed!」這邊看到 PostgreSQL 9.5 (還沒出) 將會有 Parallel Sequential Scan 的功能。

文章的作者直接拿了一個大家超常用的惡搞來示範,也就是經典的 LIKE '%word%'

rhaas=# \timing
Timing is on.
rhaas=# select * from pgbench_accounts where filler like '%a%';
 aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
 aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 213.412 ms

這功能真不錯 XD

Google 的書本掃描服務被認定為「合理使用」

Google 的書本掃描服務被認定為合理使用:「Google's Book-Scanning Project Ruled to Be Legal `Fair Use'」。

“Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality and display of snippets from those works are non-infringing fair uses,” U.S. Circuit Judge Pierre Leval wrote on behalf of the court. “The purpose of the copying is highly transformative, the public display of text is limited and the revelations do not provide a significant market substitute for the protected aspects of the originals.”

看起來是一路打到第二巡迴上訴法院了?(負責紐約地區)

用 CipherScan 在 command line 下檢查系統

在「SSL/TLS for the Pragmatic」這篇裡面提到了 CipherScan 這個工具,用起來很簡單而且輸出很清楚。

直接 git clone 下來後執行就可以了,另外因為檢測 ChaCha20+Poly1305 需要新版 OpenSSL (1.0.2 才有,目前還是開發版),所以 clone 下來的時候裡面包括了一個 Linux 版的 openssl,砍掉的話他會用系統的 openssl。

像是我的 blog 就可以掃出這樣的結果:

gslin@home [~/git/cipherscan] [17:57/W4] (master) ./cipherscan blog.gslin.org:443
........................
Target: blog.gslin.org:443

prio  ciphersuite                  protocols              pfs_keysize
1     DHE-RSA-AES256-GCM-SHA384    TLSv1.2                DH,2048bits
2     ECDHE-RSA-AES256-GCM-SHA384  TLSv1.2                ECDH,P-256,256bits
3     ECDHE-RSA-AES256-SHA384      TLSv1.2                ECDH,P-256,256bits
4     ECDHE-RSA-AES256-SHA         TLSv1,TLSv1.1,TLSv1.2  ECDH,P-256,256bits
5     DHE-RSA-AES256-SHA256        TLSv1.2                DH,2048bits
6     DHE-RSA-AES256-SHA           TLSv1,TLSv1.1,TLSv1.2  DH,2048bits
7     DHE-RSA-CAMELLIA256-SHA      TLSv1,TLSv1.1,TLSv1.2  DH,2048bits
8     AES256-GCM-SHA384            TLSv1.2
9     AES256-SHA256                TLSv1.2
10    AES256-SHA                   TLSv1,TLSv1.1,TLSv1.2
11    CAMELLIA256-SHA              TLSv1,TLSv1.1,TLSv1.2
12    ECDHE-RSA-AES128-GCM-SHA256  TLSv1.2                ECDH,P-256,256bits
13    ECDHE-RSA-AES128-SHA256      TLSv1.2                ECDH,P-256,256bits
14    ECDHE-RSA-AES128-SHA         TLSv1,TLSv1.1,TLSv1.2  ECDH,P-256,256bits
15    DHE-RSA-AES128-GCM-SHA256    TLSv1.2                DH,2048bits
16    DHE-RSA-AES128-SHA256        TLSv1.2                DH,2048bits
17    DHE-RSA-AES128-SHA           TLSv1,TLSv1.1,TLSv1.2  DH,2048bits
18    DHE-RSA-CAMELLIA128-SHA      TLSv1,TLSv1.1,TLSv1.2  DH,2048bits
19    AES128-GCM-SHA256            TLSv1.2
20    AES128-SHA256                TLSv1.2
21    AES128-SHA                   TLSv1,TLSv1.1,TLSv1.2
22    CAMELLIA128-SHA              TLSv1,TLSv1.1,TLSv1.2
23    DES-CBC3-SHA                 TLSv1,TLSv1.1,TLSv1.2

Certificate: trusted, 2048 bit, sha256WithRSAEncryption signature
TLS ticket lifetime hint: 600
OCSP stapling: supported
Server side cipher ordering

Facebook 的 InnoDB patch 讓 table scan 速度變快...

Facebook 的 Database Engineering team 實作了 patch,讓 InnoDB 在 table scan 的速度大幅提昇:「Making full table scan 10x faster in InnoDB」。

第一個 patch 叫做 Logical Readahead。第二個 patch 是針對 async i/o 的改善 (Submitting multiple async I/O requests at once)。

引用文章內的幾段話就知道這幾個 patch 的功力了:

Logical backup size is much smaller. 3x-10x size difference is not uncommon.

備份出來的資料會變小,而且宣稱 1/3 到 1/10 不是罕見情況... -_-

With logical readahead, our full table scan speed improved 9~10 times than before under usual production workloads. Under heavy production workloads, full table scan speed became 15~20 times faster.

然後 table scan 的速度會快非常多... 10 倍?如果是平常就很操的 database 會更明顯?

如果這幾個 patch 如果沒有什麼問題,可以預期會被 merge 到 PerconaMariaDB,至於 Oracle 官方的 source tree... 有的話當然很好,沒有的話也很正常?

掃整個 Internet 的 Port 22...

平常都是掃 Port 80/443,然後就有人跑去掃 Port 22:「We scanned the Internet for port 22」。依照原文說的,這次給的數據只是 60% 的 Internet,其他 40% 的資料有問題,他要再想辦法修...

這是 Top 20 的 unique banner 數據:

 1730887 SSH-2.0-OpenSSH_4.3
 1562709 SSH-2.0-OpenSSH_5.3
 1067097 SSH-2.0-dropbear_0.46
  824377 SSH-2.0-dropbear_0.51
  483318 SSH-2.0-dropbear_0.52
  348878 SSH-2.0-OpenSSH_5.9p1 Debian-5ubuntu1
  327841 SSH-1.99-Cisco-1.25
  320539 SSH-2.0-OpenSSH_5.5p1 Debian-6+squeeze3
  318279 SSH-2.0-OpenSSH_5.9p1 Debian-5ubuntu1.1
  307028 SSH-2.0-ROSSSH
  271614 SSH-2.0-OpenSSH_5.5p1 Debian-6+squeeze2
  233842 SSH-2.0-OpenSSH_5.1p1 Debian-5
  225095 SSH-2.0-OpenSSH_5.1
  224991 SSH-2.0-OpenSSH_5.1p1 FreeBSD-20080901
  213201 SSH-2.0-OpenSSH_4.7
  209023 SSH-2.0-OpenSSH_6.0p1 Debian-4
  195977 SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu7
  140809 SSH-2.0-dropbear_0.50
  135821 SSH-2.0-OpenSSH
  132351 SSH-2.0-Cisco-1.25

可以看到 OpenSSH 是最大宗,而 dropbear 應該是各種 box 的量撐起來的...

參考維基百科上 OpenSSH 的條目,OpenSSH 的各版本的發行日期分別是:

  • 3.5:2002/10/14 (FreeBSD 4.11)
  • 4.3:2006/02/01
  • 4.7:2007/09/04
  • 5.1:2008/07/21
  • 5.3:2009/10/01
  • 5.4:2010/03/08 (FreeBSD 8.3)
  • 5.5:2010/04/16 (Debian 6)
  • 5.7:2011/01/24
  • 5.8:2011/02/04 (FreeBSD 9.1)
  • 5.9:2011/09/06
  • 6.0:2012/04/22 (Debian 7)
  • 6.1:2012/08/29 (FreeBSD 8.4)

後面的作業系統是就手上有的機器來看,不過 4.3 是最多的是怎麼一回事呢... 作者這樣解釋:

Note that these counts are a bit off. Some networks have a router that forwards all connections of a certain port to a single machine. Maybe "OpenSSH_4.3" is most popular banner, or maybe the national ISP of Elbonia just reroutes all port 22 requests.

所以有可能只是個假象?XD

在 PostgreSQL 上用 GPU 加速計算...

看到 PGStorm 這個 PostgreSQL 上的惡搞套件,可以把本來 CPU 要做的事情丟到 GPU 上加速...

不過例子很怪啊,不是用 R-tree index 解決的事情嗎?PostgreSQL 明明就有支援 R-tree index 啊?為什麼會要這樣設計,然後用 table scan?我再回去想想好了...