Library – Page 6 – Gea-Suan Lin's BLOG

用在 IoT 裝置上的壓縮演算法 Heatshrink

在「Heatshrink – An ultra-lightweight compression library for embedded systems」這邊看到的演算法 Hearshrink，可以看到主打是在記憶體的用量受限的環境下壓縮。

在 2013 年的資料就有壓縮率的比較了：「heatshrink: An Embedded Data Compression Library」。

像是目前常被拿來使用的 ESP32 就只有 320KB 記憶體，gzip 就明顯太肥大了，HS 在這邊就可以犧牲壓縮率來換效能...

另外找了一下資料，發現有 lowzip 這個東西，走 ZIP 格式，記憶體用量也不高，不過軟體本身還掛 alpha：

Current x64 code footprint (for lowzip.c, excluding the test program) is about 3.2kB and RAM footprint is about 1.1kB.

如果之後打算要透過 LPWAN 之類的網路傳東西的話好像有可能會用到，先寫下來...

OpenSSL 1.0.2 與 Let's Encrypt 在這個月月底的相容性問題

看到 OpenSSL 的官方居然特地寫一篇與 Let's Encrypt 的相容性問題：「Old Let’s Encrypt Root Certificate Expiration and OpenSSL 1.0.2」。

這邊提到的 OpenSSL 1.0.2 很舊了 (在 Ubuntu 16.04 內是 1.0.2g)，理論上大多數的機器應該不太會遇到這個問題。

問題出自 Let's Encrypt 舊的 DST Root CA X3 將在這個月月底過期，這在 Let's Encrypt 的「DST Root CA X3 Expiration (September 2021)」這邊也有提到。

The currently recommended certificate chain as presented to Let’s Encrypt ACME clients when new certificates are issued contains an intermediate certificate (ISRG Root X1) that is signed by an old DST Root CA X3 certificate that expires on 2021-09-30.

理想上只有要任何一條 trust chain 成立，就應該會把這個憑證認為是合法的憑證，但這在 OpenSSL 1.0.2 (以及之前的版本) 不是這樣設計。

舊版的設計是只要有任何一條過期的憑證，就會把憑證認為過期而失效：

Unfortunately this does not apply to OpenSSL 1.0.2 which always prefers the untrusted chain and if that chain contains a path that leads to an expired trusted root certificate (DST Root CA X3), it will be selected for the certificate verification and the expiration will be reported.

OpenSSL 官方給了三個 workaround 可以做，另外我還有想到一個惡搞方式，是可以用其他家免費的憑證... 不過也是得測看看在 OpenSSL 1.0.2 下會不會動。

OpenSSL 3.0 釋出，使用 Apache License 2.0

OpenSSL 3.0 推出了，這是轉換到 Apache License 2.0 後的第一個正式版本：「OpenSSL 3.0 Has Been Released!」。

中間跳過 2.0 的原因在維基百科上也有提到，因為之前被 OpenSSL FIPS module 用掉了：

The major version 2.0.0 was skipped due to its previous use in the OpenSSL FIPS module.

雖然 3.0.0 看起來是大版本，不過主要的功能都在 OpenSSL 1.1.1 先加進去了，沒有什麼特別的理由現在就要升級到 3.0.0...

Elasticsearch 的 Python 套件開始阻擋 OpenSearch 的伺服器了

在 Hacker News Daily 上看到的：「Official Elasticsearch Python library no longer works with open-source forks (github.com/elastic)」，連結所指向的是 GitHub 上的 pull request，在「Verify connection to Elasticsearch #1623」這邊。

講白了也就是 Elasticsearch 官方的 Python client 開始阻擋 AWS 主推的 OpenSearch。

另外 AWS 這邊也出手，把本來的 client 都 fork 出來：「Keeping clients of OpenSearch and Elasticsearch compatible with open source」，這場戰爭還有得打...

快速產生 SQLite 資料的方式：一分鐘內產生十億筆資料

在「Towards Inserting One Billion Rows in SQLite Under A Minute」這邊看到作者想要在一分鐘內在 MBP 2019 上面寫 1B 筆資料進 SQLite，裡面有些方法還蠻值得玩一下的，這台 MBP 2019 機器的規格是：

The machine I am using is MacBook Pro, 2019 (2.4 GHz Quad Core i5, 8GB, 256GB SSD, Big Sur 11.1)

第一版是 Python 寫的，塞 10M 筆花了 15 分鐘：

In this script, I tried to insert 10M rows, one by one, in a for loop. This version took close to 15 minutes, sparked my curiosity and made me explore further to reduce the time.

加了五個 PRAGMA 的版本變成 100M 筆十分鐘：

The naive for loop version took about 10 minutes to insert 100M rows.

用批次處理則可以降到八分半：

The batched version took about 8.5 minutes to insert 100M rows.

再來是拿經典神器 PyPy 出來用，降到兩分半：

All I had to do was run my existing code, without any change, using PyPy. It worked and the speed bump was phenomenal. The batched version took only 2.5 minutes to insert 100M rows. I got close to 3.5x speed :)

接下來就是跳槽到 Rust 了，中間也有不少 tuning 相關的討論，但直接先跳到最後面好了... 最後 100M 只用了 33 秒：

I created a threaded version, where I had one writer thread that received data from a channel and four other threads which pushed data to the channel. This is the current best version which took about 32.37 seconds.

能用 PyPy 的地方還是可以考慮一下的...

用 Python 的 DuckDB 下 SQL 指令翻 Parquet 的資料

在「Querying Parquet using DuckDB」這邊看到 DuckDB 這個東西，裡面引用的文章是「Querying Parquet with Precision using DuckDB」，可以直接對 Parquet 格式的資料下 SQL 找資料。

先前好像有看到 DuckDB 但沒有太注意，剛剛再次看到，然後玩了一下還蠻有趣的。DuckDB 支援蠻多程式語言與資料格式，不過這邊文章拿 Python 與 Parquet 玩還蠻有趣的...

先把 Parquet 的範例資料抓下來，然後透過 pip 裝 duckdb：

cd /tmp; wget https://github.com/cwida/duckdb-data/releases/download/v1.0/taxi_2019_04.parquet; pip install -U duckdb

然後進到 Python 3 的互動界面：

>>> import duckdb
>>> print(duckdb.query("SELECT COUNT(*) FROM 'taxi_2019_04.parquet' WHERE pickup_at BETWEEN '2019-04-15' AND '2019-04-20'").fetchall())
[(1276565,)]

然後在範例裡面，檔名的部份還可以用 *，看了一下說明，底層是 glob 類的用法：

DuckDB supports the globbing syntax, which allows it to query all three files simultaneously.

文章裡有提到速度比 Pandas 快很多，不過我覺得這好像不太能這樣比，會拿 Pandas 出來的時候常常是其他用法，但至少看起來速度是個 DuckDB 在意的點。

不過反而馬上想到的是，之後處理 CSV 之類的檔案應該也會試看看 DuckDB...

cURL 與 Travis CI 的事件

前幾天 Daniel Stenberg (cURL 的發明人與現在的維護者) 發表一篇從 Travis CI 搬出來的文章，換到 Zuul CI 與 Circle CI 上：「Bye bye Travis CI」，對應的 Hacker News 的討論可以在「Bye Bye Travis CI (haxx.se)」這邊翻到。

文章裡提到主要的兩個點是，Travis CI 當初有承諾會提供免費服務給 open source project：

今年的時候整個 Travis CI 的商業收費的機制改了，另外也嚴格增加了對 open source project 的限制，包括了你不能收到任何商業公司或是任何組織的贊助：

Project must not be sponsored by a commercial company or organization (monetary or with employees paid to work on the project)

另外 Daniel Stenberg 在文章裡也表明目前不打算付錢，要找市場上便宜可用的方案 (而目前看起來還有，至少 Zuul CI 與 Circle CI 都在選項內)，所以就從 Travis CI 搬離了：

Lots of people have commented and think I’m “whining” about Travis CI charging for something that is useful and that I should rather just pay up. I could probably have gone with that but I dislike their broken promise and that they don’t consider us Open source anymore and I feel I have a responsibility to use the funds we get from gracious donors as wisely and economically as possible, and that includes using no-cost or cheap services rather than services charging thousands of dollars per year.

If there really were no other available and viable options, then paying could’ve been an alternative. Now, moving on to something else was the right choice for us.

然後 Travis CI 過了兩天丟出「Open Source Terms at Travis CI – An Update and Clarification」，雖然沒有表明是在講 cURL，但放在一起看其實也都大概知道發生什麼事情。

看了一下自己的小專案 (更新頻率不高，test case 也不多)，丟 Circle CI 還算是夠用，另外自己也有弄個 GitLab，需要的時候也可以在上面跑 CI。

這邊另外看到 cURL 這種大型專案，因為 test case 數量很多，丟到不同的 CI vendor 上跑，看起來是個還不錯的架構...

Django 3.2 LTS

Django 3.2 LTS 出了：「Django 3.2 released」，對應的 release note 可以在「Django 3.2 release notes」這邊看到。

這個版本比較特別，一般版本提供大約 15 個月的支援，LTS 版本則提供 36 個月的支援，不過目前用起來的感覺還是鼓勵大家平常就要安排升級計畫，不然 3.2 升級到 4.2 鐵定是個超痛苦的過程。

昨天把 Django 3.1 的專案升級上 3.2 後，跑 test case 就遇到「Customizing type of auto-created primary keys」這邊描述的問題，目前先用 DEFAULT_AUTO_FIELD = 'django.db.models.AutoField' 解，其他的倒是沒什麼問題...

幾個我覺得有趣的，像是 JSONL 格式：

The new JSONL serializer allows using the JSON Lines format with dumpdata and loaddata. This can be useful for populating large databases because data is loaded line by line into memory, rather than being loaded all at once.

然後不支援 PostgreSQL 9.5 與 MySQL 5.6 了：

Upstream support for PostgreSQL 9.5 ends in February 2021. Django 3.2 supports PostgreSQL 9.6 and higher.

The end of upstream support for MySQL 5.6 is April 2021. Django 3.2 supports MySQL 5.7 and higher.

很久前寫 Django 1.x，然後就荒廢很久，最近是從 3.0 開始寫，有些東西熟一點以後覺得怪怪的，之後要找時間測一些東西，修正對 Django 的用法。

Mapbox GL JS 的授權改變，以及 MapLibre GL 的誕生

看到「MapLibre GL is a free and open-source fork of mapbox-gl-JS (github.com/maplibre)」這篇，翻了一下資料發現年初時 Mapbox GL JS 的軟體授權從 v2.0.0 開始變成不是 open source license (本來是 BSD license)，而社群也馬上 fork 最後一個 open source 版本並且投入開發，變成 MapLibre GL。

MapTiler 在年初的時候有提到這件事情：「MapLibre: Mapbox GL open-source fork」。

The community reacted swiftly: forks of the latest open-source version were made almost immediately by multiple parties. In another positive development, the community came together the next day and agreed to make this a joint effort, rather than splitting energies. A video call was organized and the MapLibre coalition was formed. It includes people working for MapTiler, Elastic, StadiaMaps, Microsoft, Ceres Imaging, WhereGroup, Jawg, Stamen Design, etc.

MapLibre GL 目前與本來 v1.13.0 相容，可以直接抽換過去 (後來在二月的時候有出一個 v1.13.1，不過那是在 v2.0.0 改 license 之後的事情了)：

  "dependencies": {
-    "mapbox-gl": "^1.13.0"
+    "maplibre-gl": ">=1.14.0"
  }

記錄一下，以後要在網站上用的話，得注意到 Mapbox GL JS 在沒有註冊的情況下不能使用，而且 SDK 會強制蒐集資料：

Mapbox gl-js version 2.0 or higher (“Mapbox Web SDK”) must be used according to the Mapbox Terms of Service. This license allows developers with a current active Mapbox account to use and modify the Mapbox Web SDK. Developers may modify the Mapbox Web SDK code so long as the modifications do not change or interfere with marked portions of the code related to billing, accounting, and anonymized data collection. The Mapbox Web SDK only sends anonymized usage data, which Mapbox uses for fixing bugs and errors, accounting, and generating aggregated anonymized statistics. This license terminates automatically if a user no longer has an active Mapbox account.

不過如果是抓 OpenStreetMap 資料的話，Leaflet 應該還是目前的首選...

sscanf() 與 strlen() 的故事繼續發展

昨天在「GTA 的啟動讀取效能問題」這邊提到了 sscanf() 與 strlen() 的問題，剛剛在 Hacker News Daily 上又看到一篇「It Can Happen to You」，在講他自己的專案也中獎。

他提到了一個解法，用 strtof() 取代 sscanf() 讀數字，結果大幅降低了 parsing 的時間：

Replacing the sscanf call with strtof improved startup by nearly a factor of 10: from 1.8 seconds to 199 milliseconds.

文章的最後面題到了不少目前正在進行中的討論與 patch。

首先是 FreeBSD 上的 patch 已經在測試：「address a performance problem w/ partial sscanf on long strings...」，裡面可以看到有很小心的在研究會不會造成 performance regression。

然後是 glibc 這邊，在 2014 年就有被開了一張票提出來：「Bug 17577 - sscanf extremely slow on large strings」，不過下面只是多了幾個 comment，目前沒有任何進度。

然後是 cppreference.com 上的「std::scanf, std::fscanf, std::sscanf」頁面則是加注了複雜度的問題：

Complexity

Not guaranteed. Notably, some implementations of sscanf are O(N), where N = std::strlen(buffer) [1]. For performant string parsing, see std::from_chars.

感覺接下來應該還會有更多人提出自己的災情，或是有人發現某個跑很慢的專案也是因為這個原因...