Amazon Transcribe 可以吃其他格式了

Amazon TranscribeAWS 推出語音轉文字的服務,先前只有提供 WAVFLACMP3MP4 格式,現在則是多支援不少格式:

Today, we are excited to announce native support for media files in AMR, AMR-WB, Ogg and WebM format by Amazon Transcribe.

AMRAMR-WB 以前還蠻常看到的,最近比較少看到了,可能是專利加上選擇性多之後用的人就變少了。

再來是 OggWebM 兩個都是開放格式。

上次拿 Amazon Transcribe 測日文的影片,先用 FFmpeg 把 MP4 檔內的 audio track 抽出來再丟上去轉,轉完後用 andyhopp/aws-transcribe-to-srt 把 Amazon Transcribe 輸出的 JSON 再轉成 SRT 檔,就辨識正確度測起來算是堪用,但專有名詞 (像是人名) 就得另外處理,不過比什麼都沒有好不少...

AWS Ground Station 多了南非區可以打

看到「AWS Ground Station is now available in the Africa (Cape Town) Region」這邊的消息,AWS Ground Station 可以在南非區使用。

這是南半球第二個點 (第一個是雪梨):

AWS Ground Station is available today in US West (Oregon), US East (Ohio), Middle East (Bahrain), EU (Stockholm), Asia Pacific (Sydney), EU (Ireland), and Africa (Cape Town) with more regions coming soon.

如果以補 coverage 的角度來看,南美與東北亞 (東京或是南韓) 應該也有機會?

PostgreSQL 13 的 B-Tree Deduplication

Hacker News 上看到「Lessons Learned from Running Postgres 13: Better Performance, Monitoring & More」這篇文章,其中有提到 PostgreSQL 13 因為 B-Tree 支援 deduplication,所以有機會縮小不少空間。

搜了一下源頭是「Add deduplication to nbtree.」這個 git commit,而 PostgreSQL 官方的說明則是在「63.4.2. Deduplication」這邊可以看到。

另外值得一提的是,這個功能在 CREATE INDEX 這頁可以看到在 PostgreSQL 13 預設會打開使用。

依照說明,看起來本來的機制是當 B-Tree index 內的 key 相同時,像是 key1 = key2 = key3 這樣,他會存 {key1, ptr1}{key2, ptr2}{key3, ptr3}

在新的架構下開啟 deduplication 後就會變成類似 {key1, [ptr1, ptr2, ptr3]} 這樣的結構。可以看出來在 key 重複的資料很多的時候,可以省下大量空間 (以術語來說的話,就是 cardinality 偏低的時候)。

這樣看起來可以降低不少壓力...

Let's Encrypt 生了新的 Root 與 Intermediate Certificate

Let's Encrypt 弄了新的 Root Certificate 與 Intermediate Certificate:「Let's Encrypt's New Root and Intermediate Certificates」。

一方面是本來的 Intermediate Certificate 也快要要過期了,另外一方面是要利用 ECDSA 降低傳輸時的頻寬成本:

On Thursday, September 3rd, 2020, Let’s Encrypt issued six new certificates: one root, four intermediates, and one cross-sign. These new certificates are part of our larger plan to improve privacy on the web, by making ECDSA end-entity certificates widely available, and by making certificates smaller.

本來有 Let's Encrypt Authority {X1,X2,X3,X4} 四組 Intermediate Certificate,都是 RSA 2048 bits。

其中 X1 與 X2 差不多都到期了 (cross-signed 的已經過了,自家 ISRG Root X1 簽的剩不到一個月),不過這兩組已經沒在用了,這次就不管他了。

而 X3 與 X4 這兩組則是明年到期,會產生出新的 Intermediate Certificate,會叫做 R3 與 R4,跟之前一樣會被自家 ISRG Root X1 簽,以及 IdenTrust DST Root CA X3 簽:

For starters, we’ve issued two new 2048-bit RSA intermediates which we’re calling R3 and R4. These are both issued by ISRG Root X1, and have 5-year lifetimes. They will also be cross-signed by IdenTrust. They’re basically direct replacements for our current X3 and X4, which are expiring in a year. We expect to switch our primary issuance pipeline to use R3 later this year, which won’t have any real effect on issuance or renewal.

然後是本次的重頭戲,會弄出一個新的 Root Certificate,叫做 ISRG Root X2,以及兩個 Intermediate Certificate,叫做 E1 與 E2:

The other new certificates are more interesting. First up, we have the new ISRG Root X2, which has an ECDSA P-384 key instead of RSA, and is valid until 2040. Issued from that, we have two new intermediates, E1 and E2, which are both also ECDSA and are valid for 5 years.

主要的目的就是降低 TLS 連線時的 bandwidth,這次的設計預期可以降低將近 400 bytes:

While a 2048-bit RSA public key is about 256 bytes long, an ECDSA P-384 public key is only about 48 bytes. Similarly, the RSA signature will be another 256 bytes, while the ECDSA signature will only be 96 bytes. Factoring in some additional overhead, that’s a savings of nearly 400 bytes per certificate. Multiply that by how many certificates are in your chain, and how many connections you get in a day, and the bandwidth savings add up fast.

另外一個特別的修改是把名字改短 (把「Let's Encrypt Authority」拿掉),也是為了省傳輸的成本:

As an aside: since we’re concerned about certificate sizes, we’ve also taken a few other measures to save bytes in our new certificates. We’ve shortened their Subject Common Names from “Let’s Encrypt Authority X3” to just “R3”, relying on the previously-redundant Organization Name field to supply the words “Let’s Encrypt”. We’ve shortened their Authority Information Access Issuer and CRL Distribution Point URLs, and we’ve dropped their CPS and OCSP urls entirely. All of this adds up to another approximately 120 bytes of savings without making any substantive change to the useful information in the certificate.

這個部份讓我想到之前寫的「省頻寬的方法:終極版本...」這篇,裡面提到 AWS 自家的 SSL Certificate 太胖,改用 DigiCert 的反而可以省下不少錢 XDDD

另外也提到了這次 cross-sign 的部份是對 ECDSA Root Certificate 簽 (ISRG Root X2),而不是對 ECDSA Intermediate Certificate 簽 (E1 與 E2),主因是不希望多一次切換的轉移期:

In the end, we decided that providing the option of all-ECDSA chains was more important, and so opted to go with the first option, and cross-sign the ISRG Root X2 itself.

這算是蠻重要的進展,看起來各家 client 最近應該都會推出新版支援。

DuckDB

看到篇有趣的介紹,在講 DuckDB:「DuckDB」。

[I]t uses the PostgreSQL parser but models itself after SQLite in that databases are a single file and the code is designed for use as an embedded library, distributed in a single amalgamation C++ file (SQLite uses a C amalgamation).

看起來是個以 OLAP 為中心而設計出來的資料庫,然後在 Python 下可以直接透過 pip 裝起來。

看起來像是個用單機拼 throughput 的東西,但提供大家熟悉的界面。

Hacker News 上可以看到「DuckDB – An embeddable SQL database like SQLite, but supports Postgres features (duckdb.org)」這邊給了不少方向,

Raspberry Pi 4 可以透過 USB 開機了

一樣是在 Hacker News Daily 上看到這篇:「Boot from USB · Issue #28 · raspberrypi/rpi-eeprom」,不過 GitHub 上的討論沒什麼重點,主要是 Hacker News 上的討論:「Raspberry Pi 4 can finally boot directly from USB (github.com)」。

這個功能在以前的 Raspberry Pi 上是可以用的,但大概是因為 RPi4 的 USB 模組整個換掉,所以在 RPi4 推出時是沒這個功能的,直到最近更新韌體才支援。

不過就算以前可以用 USB 2.0 開機,但因為 USB 2.0 的極限值只有 480Mbps,扣掉 USB mass storage 的 protocol overhead 後,跟 SD 卡相比沒有差太多速度,大家就沒有太在意這個項目。但現在 USB 3.0 的速度快不少,讓這邊有不少發揮的空間,之後來找機會來外接個硬碟測一些東西看看好了...

然後順便提一下,以前要弄這種東西都是同事之間一起團購,但後來發現 PChome 24h 上也有賣,而且價位也算是還算可以?(加上又是隔天到?)

抓 PDF 裡文字的問題

Hacker News Daily 上看到的,在講從 PDF 裡面拉文字出來遇到的各種問題:「What's so hard about PDF text extraction?」。

FilingDB 是一家處理歐洲公司資料的公司,可能是開公司時送件的時候要求用 PDF,或是政府單位輸出的時候用 PDF,所以他們必須從這些 PDF 裡面拉出文字分析,然後就能夠讓程式使用:

會這麼難搞的原因是因為 PDF 是設計給輸出端用,而不是語意化用的格式:

The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.

每個字元 (character) 都是可以被獨立控制的物件:

At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs - or even words - but as characters which are painted at certain locations on the page.

然後文章後面都在展示各種 workaround XD

Wikimedia 弄了自己的 Mattermost

Wikimedia (維基百科後面的基金會) 又多了一個溝通工具:「Introducing Wikimedia Chat!」。

最傳統的方式是在 wiki 的 Talk 頁上溝通 (現在看起來還是有些正式的投票討論需要走這個方式),但那個界面用起來真的頗痛苦... 一般的社群討論還是會在其他工具上進行。

先前有晃進去看過的平台應該是 IRC 與 Telegram 群組,不過後來因為量太大就閃出來了,另外這邊有提到 SlackDiscordFacebook

You can now see Wikimedia-related discussion groups in Slack, Discord, Telegram, Facebook, and many more.

這些平台都還是放在外部,就會有很多隱私上的考量:

Besides being scattered and inaccessible to people who don’t have accounts in those platforms (for privacy reasons for example), these platforms use proprietary and closed-source software, are outside Wikimedia infrastructure and some harvest our personal data for profit.

freenode 上面的 IRC 算是相對起來比較開放,但還是少了不少功能,所以就自己架了 Mattermost 出來:

IRC on freenode.net is a good alternative but it lacks basic functionalities of a modern chat platform. So we created Wikimedia Chat, a Mattermost instance hosted in Wikimedia Cloud.

比較特別的是超過 90 天的記錄會被砍掉?不太懂這邊的邏輯...

As a Wikimedia Cloud project, all of discussions, private and public are covered by Code of conduct in technical spaces and due to Wikimedia Cloud privacy policy all discussions older than ninety days will be deleted.

Amazon Transcribe 可以自動偵測語言了

Amazon Transcribe 可以將聲音轉成文字,先前都需要自己指定語言,而這幾天發表新的功能,可以自動偵測語言:「Amazon Transcribe Now Supports Automatic Language Identification」。

不過系統要求最少要有 30 秒的資料,跟人類比起來還是有點差距,但比起之前好用不少:

With a minimum of 30 seconds of audio, Amazon Transcribe can efficiently generate transcripts in the spoken language without wasting time and resources on manual tagging.

沒有額外的費用,主要就是照著本來的價錢在走:

There is no additional charge on top of the existing pricing.

翻了一下價錢,好像可以來測一些東西...

CloudFront 宣佈支援 Brotli

CloudFront 宣佈支援 Brotli:「Amazon CloudFront announces support for Brotli compression」。

官方的說明發現 Gzip 可以好 24%:

CloudFront's Brotli edge compression delivers up to 24% smaller file sizes as compared to Gzip.

Akamai 在「Understanding Brotli's Potential」這邊提到的測試數字稍微做了分類,可以看到在 html 下 Brotli 帶來的改善是最多的。

以前在 CloudFront 上還是可以支援 Brotli,主要是透過後端支援 Brotli 的方式傳回不同的資料,再加上 Vary: Accept-Encoding 的設定讓 CloudFront 針對不同的 Accept-Encoding 分開 cache。

這次的支援等於是讓 CloudFront 理解 Brotli,就可以提昇 hit rate 並且降低後端的壓力:

Prior to today, you could enable Brotli compression at the origin by whitelisting the 'Accept-Encoding' header. Now CloudFront includes 'br' in the normalized 'Accept-Encoding' header before forwarding it to your origin. You no longer need to whitelist the 'Accept-Encoding' header to enable Brotli origin compression, improving your overall cache hit ratio. Additionally, if your origin sends uncompressed content to CloudFront, CloudFront can now automatically compress cacheable responses at the edge using Brotli.

算是補產品線...