Amazon Transcribe 可以吃其他格式了

Amazon TranscribeAWS 推出語音轉文字的服務,先前只有提供 WAVFLACMP3MP4 格式,現在則是多支援不少格式:

Today, we are excited to announce native support for media files in AMR, AMR-WB, Ogg and WebM format by Amazon Transcribe.

AMRAMR-WB 以前還蠻常看到的,最近比較少看到了,可能是專利加上選擇性多之後用的人就變少了。

再來是 OggWebM 兩個都是開放格式。

上次拿 Amazon Transcribe 測日文的影片,先用 FFmpeg 把 MP4 檔內的 audio track 抽出來再丟上去轉,轉完後用 andyhopp/aws-transcribe-to-srt 把 Amazon Transcribe 輸出的 JSON 再轉成 SRT 檔,就辨識正確度測起來算是堪用,但專有名詞 (像是人名) 就得另外處理,不過比什麼都沒有好不少...

AWS Ground Station 多了南非區可以打

看到「AWS Ground Station is now available in the Africa (Cape Town) Region」這邊的消息,AWS Ground Station 可以在南非區使用。

這是南半球第二個點 (第一個是雪梨):

AWS Ground Station is available today in US West (Oregon), US East (Ohio), Middle East (Bahrain), EU (Stockholm), Asia Pacific (Sydney), EU (Ireland), and Africa (Cape Town) with more regions coming soon.

如果以補 coverage 的角度來看,南美與東北亞 (東京或是南韓) 應該也有機會?

Let's Encrypt 生了新的 Root 與 Intermediate Certificate

Let's Encrypt 弄了新的 Root Certificate 與 Intermediate Certificate:「Let's Encrypt's New Root and Intermediate Certificates」。

一方面是本來的 Intermediate Certificate 也快要要過期了,另外一方面是要利用 ECDSA 降低傳輸時的頻寬成本:

On Thursday, September 3rd, 2020, Let’s Encrypt issued six new certificates: one root, four intermediates, and one cross-sign. These new certificates are part of our larger plan to improve privacy on the web, by making ECDSA end-entity certificates widely available, and by making certificates smaller.

本來有 Let's Encrypt Authority {X1,X2,X3,X4} 四組 Intermediate Certificate,都是 RSA 2048 bits。

其中 X1 與 X2 差不多都到期了 (cross-signed 的已經過了,自家 ISRG Root X1 簽的剩不到一個月),不過這兩組已經沒在用了,這次就不管他了。

而 X3 與 X4 這兩組則是明年到期,會產生出新的 Intermediate Certificate,會叫做 R3 與 R4,跟之前一樣會被自家 ISRG Root X1 簽,以及 IdenTrust DST Root CA X3 簽:

For starters, we’ve issued two new 2048-bit RSA intermediates which we’re calling R3 and R4. These are both issued by ISRG Root X1, and have 5-year lifetimes. They will also be cross-signed by IdenTrust. They’re basically direct replacements for our current X3 and X4, which are expiring in a year. We expect to switch our primary issuance pipeline to use R3 later this year, which won’t have any real effect on issuance or renewal.

然後是本次的重頭戲,會弄出一個新的 Root Certificate,叫做 ISRG Root X2,以及兩個 Intermediate Certificate,叫做 E1 與 E2:

The other new certificates are more interesting. First up, we have the new ISRG Root X2, which has an ECDSA P-384 key instead of RSA, and is valid until 2040. Issued from that, we have two new intermediates, E1 and E2, which are both also ECDSA and are valid for 5 years.

主要的目的就是降低 TLS 連線時的 bandwidth,這次的設計預期可以降低將近 400 bytes:

While a 2048-bit RSA public key is about 256 bytes long, an ECDSA P-384 public key is only about 48 bytes. Similarly, the RSA signature will be another 256 bytes, while the ECDSA signature will only be 96 bytes. Factoring in some additional overhead, that’s a savings of nearly 400 bytes per certificate. Multiply that by how many certificates are in your chain, and how many connections you get in a day, and the bandwidth savings add up fast.

另外一個特別的修改是把名字改短 (把「Let's Encrypt Authority」拿掉),也是為了省傳輸的成本:

As an aside: since we’re concerned about certificate sizes, we’ve also taken a few other measures to save bytes in our new certificates. We’ve shortened their Subject Common Names from “Let’s Encrypt Authority X3” to just “R3”, relying on the previously-redundant Organization Name field to supply the words “Let’s Encrypt”. We’ve shortened their Authority Information Access Issuer and CRL Distribution Point URLs, and we’ve dropped their CPS and OCSP urls entirely. All of this adds up to another approximately 120 bytes of savings without making any substantive change to the useful information in the certificate.

這個部份讓我想到之前寫的「省頻寬的方法:終極版本...」這篇,裡面提到 AWS 自家的 SSL Certificate 太胖,改用 DigiCert 的反而可以省下不少錢 XDDD

另外也提到了這次 cross-sign 的部份是對 ECDSA Root Certificate 簽 (ISRG Root X2),而不是對 ECDSA Intermediate Certificate 簽 (E1 與 E2),主因是不希望多一次切換的轉移期:

In the end, we decided that providing the option of all-ECDSA chains was more important, and so opted to go with the first option, and cross-sign the ISRG Root X2 itself.

這算是蠻重要的進展,看起來各家 client 最近應該都會推出新版支援。

抓 PDF 裡文字的問題

Hacker News Daily 上看到的,在講從 PDF 裡面拉文字出來遇到的各種問題:「What's so hard about PDF text extraction?」。

FilingDB 是一家處理歐洲公司資料的公司,可能是開公司時送件的時候要求用 PDF,或是政府單位輸出的時候用 PDF,所以他們必須從這些 PDF 裡面拉出文字分析,然後就能夠讓程式使用:

會這麼難搞的原因是因為 PDF 是設計給輸出端用,而不是語意化用的格式:

The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.

每個字元 (character) 都是可以被獨立控制的物件:

At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs - or even words - but as characters which are painted at certain locations on the page.

然後文章後面都在展示各種 workaround XD

Wikimedia 弄了自己的 Mattermost

Wikimedia (維基百科後面的基金會) 又多了一個溝通工具:「Introducing Wikimedia Chat!」。

最傳統的方式是在 wiki 的 Talk 頁上溝通 (現在看起來還是有些正式的投票討論需要走這個方式),但那個界面用起來真的頗痛苦... 一般的社群討論還是會在其他工具上進行。

先前有晃進去看過的平台應該是 IRC 與 Telegram 群組,不過後來因為量太大就閃出來了,另外這邊有提到 SlackDiscordFacebook

You can now see Wikimedia-related discussion groups in Slack, Discord, Telegram, Facebook, and many more.

這些平台都還是放在外部,就會有很多隱私上的考量:

Besides being scattered and inaccessible to people who don’t have accounts in those platforms (for privacy reasons for example), these platforms use proprietary and closed-source software, are outside Wikimedia infrastructure and some harvest our personal data for profit.

freenode 上面的 IRC 算是相對起來比較開放,但還是少了不少功能,所以就自己架了 Mattermost 出來:

IRC on freenode.net is a good alternative but it lacks basic functionalities of a modern chat platform. So we created Wikimedia Chat, a Mattermost instance hosted in Wikimedia Cloud.

比較特別的是超過 90 天的記錄會被砍掉?不太懂這邊的邏輯...

As a Wikimedia Cloud project, all of discussions, private and public are covered by Code of conduct in technical spaces and due to Wikimedia Cloud privacy policy all discussions older than ninety days will be deleted.

Amazon Transcribe 可以自動偵測語言了

Amazon Transcribe 可以將聲音轉成文字,先前都需要自己指定語言,而這幾天發表新的功能,可以自動偵測語言:「Amazon Transcribe Now Supports Automatic Language Identification」。

不過系統要求最少要有 30 秒的資料,跟人類比起來還是有點差距,但比起之前好用不少:

With a minimum of 30 seconds of audio, Amazon Transcribe can efficiently generate transcripts in the spoken language without wasting time and resources on manual tagging.

沒有額外的費用,主要就是照著本來的價錢在走:

There is no additional charge on top of the existing pricing.

翻了一下價錢,好像可以來測一些東西...

CloudFront 宣佈支援 Brotli

CloudFront 宣佈支援 Brotli:「Amazon CloudFront announces support for Brotli compression」。

官方的說明發現 Gzip 可以好 24%:

CloudFront's Brotli edge compression delivers up to 24% smaller file sizes as compared to Gzip.

Akamai 在「Understanding Brotli's Potential」這邊提到的測試數字稍微做了分類,可以看到在 html 下 Brotli 帶來的改善是最多的。

以前在 CloudFront 上還是可以支援 Brotli,主要是透過後端支援 Brotli 的方式傳回不同的資料,再加上 Vary: Accept-Encoding 的設定讓 CloudFront 針對不同的 Accept-Encoding 分開 cache。

這次的支援等於是讓 CloudFront 理解 Brotli,就可以提昇 hit rate 並且降低後端的壓力:

Prior to today, you could enable Brotli compression at the origin by whitelisting the 'Accept-Encoding' header. Now CloudFront includes 'br' in the normalized 'Accept-Encoding' header before forwarding it to your origin. You no longer need to whitelist the 'Accept-Encoding' header to enable Brotli origin compression, improving your overall cache hit ratio. Additionally, if your origin sends uncompressed content to CloudFront, CloudFront can now automatically compress cacheable responses at the edge using Brotli.

算是補產品線...

AWS 推出了 ARM 平台上 T 系列的機器

前幾天發現在 AWS Web Console 上開 EC2 機器時,選 t3a 後本來可以選的「T2/T3 Unlimited」變成只叫「Unlimited」,心裡猜測有東西要推出,然後這幾天看到消息了...

這次 AWS 推出了 t4g 系列的機器,而這邊的 g 如同慣例,指的是 ARM 的 Graviton2:「New EC2 T4g Instances – Burstable Performance Powered by AWS Graviton2 – Try Them for Free」。

目前公司在用的 ap-southeast-1 沒有在支援的地區,只好去 us-east-1 上玩:

T4g instances are available today in US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Tokyo, Mumbai), Europe (Frankfurt, Ireland).

剛好這兩天把 SOP 文件的安裝方法改成 ansible playbook,就順便拿 t4g 的機器測了一下也沒什麼問題。

另外 T 系列機器最重要的 CPU credit 的部份,在官方文件「CPU credits and baseline utilization for burstable performance instances」這邊也已經可以看到 t4g 的相關資料了,基本上跟 t3t3a 是一樣的設計。

而價錢的部份,都以 T 系列裡最大的 2xlarge 來算,Intel 平台的 t3.2xlarge 是 $0.3328/hr,AMD 平台的 t3a.2xlarge 則是 $0.3008/hr,而 t4g.2xlarge 是 $0.2688/hr,大約是 80.7% 與 89.3% 的比率。

另外官方宣稱效能還比 x86 平台上好很多,這點可以打個折看,不過就價位來說是真的不錯:

Using T4g instances you can enjoy a performance benefit of up to 40% at a 20% lower cost in comparison to T3 instances, providing the best price/performance for a broader spectrum of workloads.

不過目前公司的主力還是在新加坡區,而且還有 RI 在跑,等有了 t4g 之後再把一些東西丟上去測看看,然後找時間換過去...

新的 TLS 攻擊:Raccoon Attack

這次看到的是針對 TLS 實做上的問題產生的 Raccoon Attack,反正先取個名字就對了,原圖有點大張,設個 medium size 好了 XDDD:

Why is the attack called "Raccoon"?
Raccoon is not an acronym. Raccoons are just cute animals, and it is well past time that an attack will be named after them :)

先講影響的產品,首先是經常中槍的 F5,這次連 timing measurement 都不需要太準確就可以打穿:

In particular, several F5 products allow executing a special version of the attack, without the need for precise timing measurements.

OpenSSL 的部份因為從 1.0.2f 之後因為其他的 security issue 所以改善了實做方式,就不會受到這次的攻擊手法影響。

剛剛翻了一下 Ubuntu 上的的資料,看起來 16.04 (xenial) 上的 OpenSSL 就已經是 1.0.2g 了,所以目前只要是有在 Ubuntu 支援的版本應該都不受影響:

OpenSSL assigned the issue CVE-2020-1968. OpenSSL does use fresh DH keys per default since version 1.0.2f (which made SSL_OP_SINGLE_DH_USE default as a response to CVE-2016-0701).

Firefox 直接拔了 DH 與 DHE 相關的 cipher suite,反正在這次攻擊手法出來前本來就已經計畫要拔掉:

Mozilla assigned the issue CVE-2020-12413. It has been solved by disabling DH and DHE cipher suites in Firefox (which was already planned before the Raccoon disclosure).

微軟的部份則是推更新出來:

Microsoft assigned the issue CVE-2020-1596. Please refer to the Microsoft Security Response Center portal.

回到攻擊手法,這次的問題是因為 DH 相關的實做造成的問題。

TLS 要求去掉 premaster secret 裡開頭的 0,造成會因為開頭的 0 數量不同而實做上就不會是 constant time,所以有了一些 side channel information 可以用:

Our Raccoon attack exploits a TLS specification side channel; TLS 1.2 (and all previous versions) prescribes that all leading zero bytes in the premaster secret are stripped before used in further computations. Since the resulting premaster secret is used as an input into the key derivation function, which is based on hash functions with different timing profiles, precise timing measurements may enable an attacker to construct an oracle from a TLS server.

然後一層一層堆,能夠知道 premaster secret 開頭是不是 0 之後,接下來因為 server side 會重複使用同一組 premaster secret,所以可以當作一個 oracle,試著去計算出更後面的位數:

This oracle tells the attacker whether a computed premaster secret starts with zero or not. For example, the attacker could eavesdrop ga sent by the client, resend it to the server, and determine whether the resulting premaster secret starts with zero or not.

Learning one byte from a premaster secret would not help the attacker much. However, here the attack gets interesting. Imagine the attacker intercepted a ClientKeyExchange message containing the value ga. The attacker can now construct values related to ga and send them to the server in distinct TLS handshakes. More concretely, the attacker constructs values gri*ga, which lead to premaster secrets gri*b*gab. Based on the server timing behavior, the attacker can find values leading to premaster secrets starting with zero. In the end, this helps the attacker to construct a set of equations and use a solver for the Hidden Number Problem (HNP) to compute the original premaster secret established between the client and the server.

所以針對這個攻擊手法的解法就是用「新鮮的」premaster secret (像是完全不重複使用),然後保留開頭的 0,不需要去掉。而 TLS 1.3 在定義的時候把這兩件事情都做了,所以不會受到影響:

Is TLS 1.3 also affected?
No. In TLS 1.3, the leading zero bytes are preserved for DHE cipher suites (as well as for ECDHE ones) and keys should not be reused.

另外在這邊提到的 Hidden Number Problem (HNP) 也是個不熟悉的詞彙,網站上有提到論文,也就是「Hardness of computing the most significant bits of secret keys in Diffie-Hellman and related schemes」這篇:

Given an oracle Oα(x) that on input x computes the k most significant bits of (α * gx mod p) , find α mod p.

是個離散對數類的問題,之後有空再來翻一翻好了。

libtorrent 宣佈支援 BitTorrent v2

看到 libtorrent 宣佈支援 BitTorrent v2 (BEP 52) 的消息:「BitTorrent v2」。

BitTorrent v2 這個規格丟出來好久了,但一直都是 draft,而且沒什麼人想要理他,直到 Google 成功產生出 SHA-1 collision 的時候稍微有些音量跑出來,但沒想到居然有人跳下去支援了...

對使用者比較有感覺的差異是從 SHA-1 換成 SHA-2 的 SHA-256 了,這個會影響到整個 torrent file 的結構與 Magnet URI 的部份。

另外一個比較大的改變是 torrent 檔資料結構,有兩個比較大的改變。

第一個是以前用固定的 block size 切割,然後每個 block 產生出 hash,所以 torrent 檔會隨著 block size 選擇的大小 (成反比) 檔案大小 (成正比) 有關,現在會用 Merkle tree,所以只要有 root hash 就可以了。

第二個是以前是把所有檔案包在一起 hash,現在是個別檔案都有自己的 hash (改成 root hash),所以現在變成可以跨 torrent 檔共用檔案。

然後 libtorrent 的文章裡有提到向前相容的方法,不過以產品面上來說沒有什麼太大的誘因,libtorrent 雖然大,但其他幾家的支援度應該也是重點...