encoding – Page 2 – Gea-Suan Lin's BLOG

AWS Media Services 推出一卡車與影音相關的服務...

AWS 推出了一連串 AWS Elemental MediaOOXX 一連串影音相關的服務：「AWS Media Services – Process, Store, and Monetize Cloud-Based Video」。

但不是所有的服務都是相同的區域... 公告分別在：

Introducing AWS Elemental MediaConvert (US East (Virginia), US West (Oregon), US West (Northern California), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), EU (Frankfurt), and EU (Ireland))
Introducing AWS Elemental MediaLive (US East (Virginia), US West (Oregon), Asia Pacific (Singapore), and EU (Ireland))
Introducing AWS Elemental MediaPackage ( US East (Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), and EU (Ireland))
Introducing AWS Elemental MediaStore (US East (Virginia), US West (Oregon), Asia Pacific (Sydney), and EU (Ireland))
Introducing AWS Elemental MediaTailor (US East (Virginia), and EU (Ireland))

不過這邊還是引用 Jeff Barr 文章裡的說明，可以看到從很源頭的 transencoding 到 DRM，以及 Live 格式，到後續的檔案儲存及後製 (像是上廣告) 都有：

AWS Elemental MediaConvert – File-based transcoding for OTT, broadcast, or archiving, with support for a long list of formats and codecs. Features include multi-channel audio, graphic overlays, closed captioning, and several DRM options.

AWS Elemental MediaLive – Live encoding to deliver video streams in real time to both televisions and multiscreen devices. Allows you to deploy highly reliable live channels in minutes, with full control over encoding parameters. It supports ad insertion, multi-channel audio, graphic overlays, and closed captioning.

AWS Elemental MediaPackage – Video origination and just-in-time packaging. Starting from a single input, produces output for multiple devices representing a long list of current and legacy formats. Supports multiple monetization models, time-shifted live streaming, ad insertion, DRM, and blackout management.

AWS Elemental MediaStore – Media-optimized storage that enables high performance and low latency applications such as live streaming, while taking advantage of the scale and durability of Amazon Simple Storage Service (S3).

AWS Elemental MediaTailor – Monetization service that supports ad serving and server-side ad insertion, a broad range of devices, transcoding, and accurate reporting of server-side and client-side ad insertion.

引個前同事的 tweet，先不說 Amazon SWF 的情況 (畢竟 Amazon SWF 還可以找到其他用途)，倒是 Amazon Elastic Transcoder 很明顯要被淘汰掉了：

感覺上 SWF 和 Elastic Transcoder 要被 phase out 了…

— Rianol Jou (@RiAN0l) November 28, 2017

這種整個大包的東西是 AWS re:Invent 才有的能量，平常比較少看到...

curl 將支援 Brotli 壓縮

在 Twitter 上看到有人提到 curl 支援 Brotli 了：「HTTP: implement Brotli content encoding」。

Brotli 對文字系列的資料比較有幫助 (像是 html)：

Unlike most general purpose compression algorithms, Brotli uses a pre-defined 120 kilobyte dictionary, in addition to the dynamically populated ("sliding window") dictionary. The pre-defined dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents. Using a pre-defined dictionary has been shown to increase compression where a file mostly contains commonly-used words.

現在還在 master 裡面，之後的 release 版本應該就會支援了...

Branchless UTF-8 解碼器

看到「A Branchless UTF-8 Decoder」這篇，先來回憶一下「非常經典的 UTF-8...」這篇，以及裡面提到的 encoding：

因為當初在設計 UTF-8 時就有考慮到，所以 decoding 很容易用 DFA 解決，也就是寫成一堆 if-then-else 的條件。但現代 CPU 因為 out-of-order execution 以及 pipeline 的設計，遇到 random branch 會有很高的效能損失，所以作者就想要試著寫看看 branchless 的版本。

成效其實還好，尤其是 Clang 上說不定在誤差內：

With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.

而後來的更新則是大幅改善，在 Clang 上 DFA 版本比 branchless 的快：

Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.

所以作者最後也有說這是個嘗試而已 XD：

It’s just a different approach. In practice I’d prefer Björn’s DFA decoder.

U2F Security Key 產品測試？

Adam Langley 的「Testing Security Keys」這篇測試了不少有支援 U2F Security Key 的產品，這邊作者是以 Linux 環境測試。

tl;dr：在 Linux 環境下，除了 Yubico 的產品沒問題外，其他的都有問題... (只是差在問題多與少而已)

Yubico 的沒找到問題：

Easy one first: I can find no flaws in Yubico's U2F Security Key.

VASCO SecureClick 的則是 vendor ID 與 product ID 會跑掉：

If you're using Linux and you configure udev to grant access to the vendor ID & product ID of the token as it appears normally, nothing will work because the vendor ID and product ID are different when it's active. The Chrome extension will get very confused about this.

Feitian ePass 的 ASN.1 DER 編碼是錯的：

ASN.1 DER is designed to be a “distinguished” encoding, i.e. there should be a unique serialisation for a given value and all other representations are invalid. As such, numbers are supposed to be encoded minimally, with no leading zeros (unless necessary to make a number positive). Feitian doesn't get that right with this security key: numbers that start with 9 leading zero bits have an invalid zero byte at the beginning. Presumably, numbers starting with 17 zero bits have two invalid zero bytes at the beginning and so on, but I wasn't able to press the button enough times to get such an example. Thus something like one in 256 signatures produced by this security key are invalid.

Thetis 根本沒照 spec 走，然後也有相同的 ASN.1 DER 編碼問題：

With this device, I can't test things like key handle mutability and whether the appID is being checked because of some odd behaviour. The response to the first Check is invalid, according to the spec: it returns status 0x9000 (“NO_ERROR”), when it should be 0x6985 or 0x6a80. After that, it starts rejecting all key handles (even valid ones) with 0x6a80 until it's unplugged and reinserted.

This device has the same non-minimal signature encoding issue as the Feitian ePass. Also, if you click too fast, this security key gets upset and rejects a few requests with status 0x6ffe.

U2F Zero 直接 crash 沒辦法測 XDDD：

A 1KiB ping message crashes this device (i.e. it stops responding to USB messages and needs to be unplugged and reinserted). Testing a corrupted key handle also crashes it and thus I wasn't able to run many tests.

KEY-ID (網站連 HTTPS 都沒上...) / HyperFIDO 也有編碼問題但更嚴重：

The Key-ID (and HyperFIDO devices, which have the same firmware, I think) have the same non-minimal encoding issue as the Feitian ePass, but also have a second ASN.1 flaw. In ASN.1 DER, if the most-significant bit of a number is set, that number is negative. If it's not supposed to be negative, then a zero pad byte is needed. I think what happened here is that, when testing the most-significant bit, the security key checks whether the first byte is > 0x80, but it should be checking whether it's >= 0x80. The upshot is the sometimes it produces signatures that contain negative numbers and are thus invalid.

所以還是乖乖用 GitHub 帳號買 Yubico 吧...

Mozilla 推出 mozjpeg 2.0

othree 前天已經寫過：「mozjpeg 2.0」，不過因為這類性的研究其實對全世界幫助頗大，所以就再提一次...

原文在「Mozilla Advances JPEG Encoding with mozjpeg 2.0」這邊，主要的成果：

With today’s release, mozjpeg 2.0 can reduce file sizes for both baseline and progressive JPEGs by 5% on average compared to those produced by libjpeg-turbo, the standard JPEG library upon which mozjpeg is based [1]. Many images will see further reductions.

文章內也出現了一些關鍵字：

We’ve added options to specifically tune for PSNR, PSNR-HVS-M, SSIM, and MS-SSIM metrics.

PSNR 是最常聽到的，其他幾個 keyword 剛好可以拿來當 entry point。在「Video quality」這邊的 See also 部份也有不少 keyword 可以查...

將 latin1 的表格轉換成 UTF-8 表格...

Percona 的人寫了一篇「utf8 data on latin1 tables: converting to utf8 without downtime or double encoding」，告訴你怎麼將 latin1 的 TEXT 欄位轉成 UTF-8，文章內有提到利用 BLOB 轉。

不確定同樣方式能不能做在 VARCHAR 上面 (用 BINARY 轉？)，但不知道會不會有 UNIQUE + prefix support 的問題？有遇到再來測試看看...

非常經典的 UTF-8...

在 Hacker News 文摘上看到「UTF-8 – “The most elegant hack”」這篇。除了維基百科上的資料以外，Rob Pike 與其他人在 2003 年寫的 mail 也是相當重要的資料。

Ken Thompson 與 Rob Pike 兩位發展出來的 UTF-8 被譽為最優雅的 hack 真的一點都不為過。Unicode 1.0 在 1991 年 10 月公佈。之後就陸陸續續有表示的格式出來...

相容於 ASCII 0-127 的 UTF-1 在 1992 年被提出來，但 parsing performance 並不好。

1992 年 7 月，Dave Prosser 提出 FSS-UTF，很類似後來的 UTF-8 但缺乏 self-synchronizing 特性 (這個特性指的是，從字串中間可以很容易找到切割點)。

1992 年 8 月，Ken Thompson 改善了 FSS-UTF，讓 bit 使用效率低一點，但因此擁有 self-synchronizing 特性。之後在 1992 年 9 月，Rob Pike 與 Ken Thompson 將 UTF-8 實做到 Plan 9 上。而 UTF-8 正式公開發表則是在 1993 年 1 月的 USENIX 上。

UTF-8 的設計看起來很 hack，但卻有這些優美的特性：

與既有系統的相容性

只包含 ASCII 0-127 的字串是合法的 UTF-8 字串。

重點是 0 被保留下來，也就是本來的 NULL-terminated 字串處理全部都可以沿用，這使得從 C 語言的 strcpy()，到一堆網路上已經跑很久的通訊協定，都可以繼續沿用。

極高的辨識性

UTF-8 很容易被判斷出來，引用維基百科的數字：

The probability of a random string of bytes which is not pure ASCII being valid UTF-8 is 3.9% for a two-byte sequence, and decreases exponentially for longer sequences.

非 ASCII 字串只要稍微有長度 (四個中文字，12 bytes？)，判斷字串是否為 UTF-8 的正確性應該跟各種服務的 SLA 有得拼...

與 Unicode 的順序對應相容

Unicode 的編號順序與 UTF-8 相容，也就是說連傳統的 strcmp() 都可以直接拿來用：

Sorting a set of UTF-8 encoded strings as strings of unsigned bytes yields the same order as sorting the corresponding Unicode strings lexicographically by codepoint.

避開 UTF-16 的 BOM

BOM 的 0xFE 與 0xFF 在合法 UTF-8 文件裡是看不到的，所以如果開頭有看到 BOM 時一定不是 UTF-8：

The bytes 0xFE and 0xFF do not appear, so a valid UTF-8 stream never matches the UTF-16 byte order mark and thus cannot be confused with it.

self-synchronizing 特性

由於 encoding 的特性，UTF-8 字串要找下一個斷點是很容易的：

找到符合這六種開頭的 string pattern 就是斷點。也因為如此，容錯率相當高。

可以容納所有 Unicode 字元

也因為 encoding 特性，UTF-8 理論值可以容納百萬個字元 (依照 RFC3629 的額外限制，是 1112064 個)。在還沒有找到很多外星文明之前，應該都還夠用。(2012 年發佈的 Unicode 6.2 也才十一萬個字元，110182 個字元)

Unicode 與 UTF-8 之間的轉換很方便

再次因為 encoding 特性，轉換幾乎是 bit 運算就可以操作完畢。(注意 Last code point 的值都切齊)

因為太多好處，變成超級標準了...

這是一個幾乎找不到缺點的 standard，所以早期很多 programmer 選擇的原因是「看了就喜歡」，於是就有大量的 library。接下來有大量的 standard (這還包括 XML standard) 直接挑明講 UTF-8 的處理能力是必要條件。

總結...

UTF-8 encoding 怎麼看都很 hack (看起來很隨意的把不同 Unicode 區段切割到不定寬度字集內，感覺不到特別的處理)，但卻很完美的解決了「如果可以處理 8bits 時，要與現有系統相容」的問題。也因為這個 encoding 把問題解決得很乾淨 (UTF-8 解決不了的，其他人都解不乾淨)，於是就變成超級主流 encodoing...

Google 在 2012 年 2 月時就寫過一篇「Unicode over 60 percent of the web」，這還是扣掉 ASCII 的 20%！

現在是 2013 年快年尾了，可以預期之後是 UTF-8 萬萬歲了...

如果想要了解更細，可以參考維基百科的「Comparison of Unicode encodings」，裡面有與其他 Unicode 格式的比較。

MySQL 的 Unicode 支援程度

MySQL 5.5 之前的版本只支援 Unicode 3.0 (1999 年 9 月發表)，但自從 MySQL 5.5 版開始支援 Unicode 5.0 (2006 年 7 月發表)，對於常用的 utf8 encoding 就有一些變化要注意...

參考維基百科上對 Unicode 版本的說明：「Unicode#Versions」，以及 MySQL 5.5 的文件：「MySQL :: MySQL 5.5 Reference Manual :: 10.1.10 Unicode Support」。

在 MySQL 5.5 之前，UTF-8 的設計最多吃 3bytes，因為 1byte 有 128 種組合 (7bits)，2bytes 有 2048 種組合 (11bits)，3bytes 有 65536 種組合 (16bits)，共 67712 個空間可以用，但 Unicode 3.0 只用掉 49259 個。

而從 MySQL 5.5 開始支援的 Unicode 5.0 需要 99089 個空間，所以需要用到 4bytes 的版本，也就是增加 4bytes 的 2097152 種組合 (21bits)，共 2164864 個空間。

但為了相容性，MySQL 5.5 的 utf8 encoding 還是使用 Unicode 3.0 版本。只有當特別指定 utf8mb4 encoding 時才會用到 Unicode 5.0 版本。使用 utf8mb4 encoding 時，要注意 client 端也要支援，不然會讀不到東西...