用 TSV 而不用 CSV

最近常常需要提供資料給其他部門 (非技術類的部門),有時候需要提供一些表格類的資料,傳統大多數人比較熟的是產生 CSV 格式的資料讓使用者可以用 Excel 打開,但這個格式其實有很多問題,最常見的就是 encoding 與資料有逗號 comma 的問題。

如果是在 Python 下,其中一個解法是用 openpyxl 直接產生 .xlsx,但用起來還是沒那麼有下面提到的方法順手。

如果是 shell script 時就比較麻煩了,像我這次手上有一堆影片檔,要用 FFmpeg 確認每個影片的 resolution 與 framerate 再提供給同事,這時候如果還是想產生 .xlsx 就累了...

下面要提到的解法好像記得是在 K 社的時候同事教的,用 TSV 檔格式 (當然檔名要取 .tsv),然後 encoding 用 UTF-16 (LE) 就可以解決上面提到的兩個問題,產生出來的檔案可以讓 Excel 直接打開。

StackOverflow 上的「Is it possible to force Excel recognize UTF-8 CSV files automatically?」這邊翻一翻,會發現裡面提到比較好的解法其實都是產生 TSV。

這邊另外推薦,就算是寫程式,也還是可以先產生出 UTF-8 的版本 (通常副檔名我都會先取 .txt),然後用 iconv 或是 piconv 轉成 UTF-16 (LE):

iconv -f utf8 -t utf16le a.txt > a.tsv

包到 Makefile 裡面用起來其實還蠻順手的...

超快速的 Base64 encoding/decoding 實做

看到「Base64 encoding and decoding at almost the speed of a memory copy」這個,可以超級快速編解碼 Base64 的資料。

實做上是透過 IntelAVX-512 加速,在資料夠大的情況下 (超過 L1 cache 的大小),可以達到接近字串複製的速度 (這邊提到的 memcpy()):

We show how we can encode and decode base64 data at nearly the speed of a memory copy (memcpy) on recent Intel processors, as long as the data does not fit in the first-level (L1) cache. We use the SIMD (Single Instruction Multiple Data) instruction set AVX-512 available on commodity processors. Our implementation generates several times fewer instructions than previous SIMD-accelerated base64 codecs.

不過這樣 AMD 暫時要哭哭...

Twitch 用 VP9 直播...

Twitch 整理了一篇「How VP9 delivers value for Twitch’s esports live streaming」,說明他們用 VP9 的經驗談。

裡面有很大的篇幅是在講 VP9 與 H.264 的比較,不過這兩個用的技術就已經不是同一個年代了,沒有進步的話就不用出來玩了...

裡面有講到一些有趣的東西,像是提到是用 FPGA 即時壓縮:

In this article, we will show that the FPGA-based real-time VP9 encoding can deliver at least 25% bitrate savings compared to the highest-quality H.264 encoders deployed in Twitch’s production today.

然後提到 1080p60 至少省了 25% 的頻寬 (這邊應該是相較於 H.264):

VP9’s Compression Efficiency for Live 1080p60 Encoding: We Can Achieve At Least 25% Bitrate Savings

查了一下,在桌機上的瀏覽器都差不多支援了:

VP9 is implemented in these web browsers:

Chromium and Google Chrome (usable by default since version 29 from May and August 2013, respectively),
Opera (since version 15 from July 2013),
Mozilla Firefox (since version 28 from March 2014),
Microsoft Edge (as of summer 2016).

行動裝置的話 Android 4.4+ 有支援,但在 iOS 上沒有支援...

整體看起來普及率算是不低,可以引入當主力 codec 降低頻寬成本,當設備不支援 VP9 時 (應該只有 iOS 透過 Safari 觀看的情況) 就用 H.264 stream 提供服務。

Chrome 72+ (目前在 Beta 的版本) 對延伸套件的影響

我的桌機平常都是跑 beta channel 的 Chrome,所以前陣子已經升級到 72,然後發現兩個問題:

從 dev console 可以看到問題都是 Referer 沒被送出,導致有檢查 Referer 的網站拒絕存取,後來在「chrome.webRequest」這邊發現是 Chrome 72+ 改了規則,裡面提到了:

Starting from Chrome 72, the following request headers are not provided and cannot be modified or removed without specifying 'extraHeaders' in opt_extraInfoSpec:

Accept-Language
Accept-Encoding
Referer
Cookie

Starting from Chrome 72, the Set-Cookie response header is not provided and cannot be modified or removed without specifying 'extraHeaders' in opt_extraInfoSpec.

不確定這個設計的目的是什麼,但反正已經中獎了,總是得回報給各 extension 的維護者讓他們修正 (用 beta channel 的重要任務之一?)。

所以就先去 Tampermonkey 開 ticket,也很迅速的在 beta 版修正了 (所以只能先改裝 beta 版):「GM_xmlhttpRequest unable to set referer in Chrome 72+ #629」。

另外跟 Referer Control 的開發者回報這個問題,也修正出新版上架了,更新就生效了:「refererControlDisqus.html」。

目前看起來還有「Spoofs Lang」得修,不過這個軟體好久沒更新了,不知道有沒有機會...

Amazon API Gateway 支援壓縮了...

Amazon API Gateway 支援壓縮了:「Amazon API Gateway Supports Content Encoding for API Responses」。

You can now enable content encoding support for API Responses in Amazon API Gateway. Content encoding allows API clients to request content to be compressed before being sent back in the response to an API request. This reduces the amount of data that is sent from API Gateway to API clients and decreases the time it takes to transfer the data. You can enable content encoding in the API definition. You can also set the minimum response size that triggers compression. By default, APIs do not have content encoding support enabled.

打開後傳回的資料就會自動壓縮了,然後還可以設定觸發的 response size... 依照文件 (Content Codings Supported by API Gateway),目前支援的壓縮格式應該是最常見的 gzipdeflate

這功能好像是一開始有 API Gateway 就一直被提出來的 feature request...

AWS Media Services 推出一卡車與影音相關的服務...

AWS 推出了一連串 AWS Elemental MediaOOXX 一連串影音相關的服務:「AWS Media Services – Process, Store, and Monetize Cloud-Based Video」。

但不是所有的服務都是相同的區域... 公告分別在:

不過這邊還是引用 Jeff Barr 文章裡的說明,可以看到從很源頭的 transencoding 到 DRM,以及 Live 格式,到後續的檔案儲存及後製 (像是上廣告) 都有:

AWS Elemental MediaConvert – File-based transcoding for OTT, broadcast, or archiving, with support for a long list of formats and codecs. Features include multi-channel audio, graphic overlays, closed captioning, and several DRM options.

AWS Elemental MediaLive – Live encoding to deliver video streams in real time to both televisions and multiscreen devices. Allows you to deploy highly reliable live channels in minutes, with full control over encoding parameters. It supports ad insertion, multi-channel audio, graphic overlays, and closed captioning.

AWS Elemental MediaPackage – Video origination and just-in-time packaging. Starting from a single input, produces output for multiple devices representing a long list of current and legacy formats. Supports multiple monetization models, time-shifted live streaming, ad insertion, DRM, and blackout management.

AWS Elemental MediaStore – Media-optimized storage that enables high performance and low latency applications such as live streaming, while taking advantage of the scale and durability of Amazon Simple Storage Service (S3).

AWS Elemental MediaTailor – Monetization service that supports ad serving and server-side ad insertion, a broad range of devices, transcoding, and accurate reporting of server-side and client-side ad insertion.

引個前同事的 tweet,先不說 Amazon SWF 的情況 (畢竟 Amazon SWF 還可以找到其他用途),倒是 Amazon Elastic Transcoder 很明顯要被淘汰掉了:

這種整個大包的東西是 AWS re:Invent 才有的能量,平常比較少看到...

curl 將支援 Brotli 壓縮

Twitter 上看到有人提到 curl 支援 Brotli 了:「HTTP: implement Brotli content encoding」。

Brotli 對文字系列的資料比較有幫助 (像是 html):

Unlike most general purpose compression algorithms, Brotli uses a pre-defined 120 kilobyte dictionary, in addition to the dynamically populated ("sliding window") dictionary. The pre-defined dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents. Using a pre-defined dictionary has been shown to increase compression where a file mostly contains commonly-used words.

現在還在 master 裡面,之後的 release 版本應該就會支援了...

Branchless UTF-8 解碼器

看到「A Branchless UTF-8 Decoder」這篇,先來回憶一下「非常經典的 UTF-8...」這篇,以及裡面提到的 encoding:

因為當初在設計 UTF-8 時就有考慮到,所以 decoding 很容易用 DFA 解決,也就是寫成一堆 if-then-else 的條件。但現代 CPU 因為 out-of-order execution 以及 pipeline 的設計,遇到 random branch 會有很高的效能損失,所以作者就想要試著寫看看 branchless 的版本。

成效其實還好,尤其是 Clang 上說不定在誤差內:

With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.

而後來的更新則是大幅改善,在 Clang 上 DFA 版本比 branchless 的快:

Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.

所以作者最後也有說這是個嘗試而已 XD:

It’s just a different approach. In practice I’d prefer Björn’s DFA decoder.

U2F Security Key 產品測試?

Adam Langley 的「Testing Security Keys」這篇測試了不少有支援 U2F Security Key 的產品,這邊作者是以 Linux 環境測試。

tl;dr:在 Linux 環境下,除了 Yubico 的產品沒問題外,其他的都有問題... (只是差在問題多與少而已)

Yubico 的沒找到問題:

Easy one first: I can find no flaws in Yubico's U2F Security Key.

VASCO SecureClick 的則是 vendor ID 與 product ID 會跑掉:

If you're using Linux and you configure udev to grant access to the vendor ID & product ID of the token as it appears normally, nothing will work because the vendor ID and product ID are different when it's active. The Chrome extension will get very confused about this.

Feitian ePass 的 ASN.1 DER 編碼是錯的:

ASN.1 DER is designed to be a “distinguished” encoding, i.e. there should be a unique serialisation for a given value and all other representations are invalid. As such, numbers are supposed to be encoded minimally, with no leading zeros (unless necessary to make a number positive). Feitian doesn't get that right with this security key: numbers that start with 9 leading zero bits have an invalid zero byte at the beginning. Presumably, numbers starting with 17 zero bits have two invalid zero bytes at the beginning and so on, but I wasn't able to press the button enough times to get such an example. Thus something like one in 256 signatures produced by this security key are invalid.

Thetis 根本沒照 spec 走,然後也有相同的 ASN.1 DER 編碼問題:

With this device, I can't test things like key handle mutability and whether the appID is being checked because of some odd behaviour. The response to the first Check is invalid, according to the spec: it returns status 0x9000 (“NO_ERROR”), when it should be 0x6985 or 0x6a80. After that, it starts rejecting all key handles (even valid ones) with 0x6a80 until it's unplugged and reinserted.

This device has the same non-minimal signature encoding issue as the Feitian ePass. Also, if you click too fast, this security key gets upset and rejects a few requests with status 0x6ffe.

U2F Zero 直接 crash 沒辦法測 XDDD:

A 1KiB ping message crashes this device (i.e. it stops responding to USB messages and needs to be unplugged and reinserted). Testing a corrupted key handle also crashes it and thus I wasn't able to run many tests.

KEY-ID (網站連 HTTPS 都沒上...) / HyperFIDO 也有編碼問題但更嚴重:

The Key-ID (and HyperFIDO devices, which have the same firmware, I think) have the same non-minimal encoding issue as the Feitian ePass, but also have a second ASN.1 flaw. In ASN.1 DER, if the most-significant bit of a number is set, that number is negative. If it's not supposed to be negative, then a zero pad byte is needed. I think what happened here is that, when testing the most-significant bit, the security key checks whether the first byte is > 0x80, but it should be checking whether it's >= 0x80. The upshot is the sometimes it produces signatures that contain negative numbers and are thus invalid.

所以還是乖乖用 GitHub 帳號買 Yubico 吧...

Mozilla 推出 mozjpeg 2.0

othree 前天已經寫過:「mozjpeg 2.0」,不過因為這類性的研究其實對全世界幫助頗大,所以就再提一次...

原文在「Mozilla Advances JPEG Encoding with mozjpeg 2.0」這邊,主要的成果:

With today’s release, mozjpeg 2.0 can reduce file sizes for both baseline and progressive JPEGs by 5% on average compared to those produced by libjpeg-turbo, the standard JPEG library upon which mozjpeg is based [1]. Many images will see further reductions.

文章內也出現了一些關鍵字:

We’ve added options to specifically tune for PSNR, PSNR-HVS-M, SSIM, and MS-SSIM metrics.

PSNR 是最常聽到的,其他幾個 keyword 剛好可以拿來當 entry point。在「Video quality」這邊的 See also 部份也有不少 keyword 可以查...