cURL 的 TLS fingerprint

看到「curl's TLS fingerprint」這篇,cURL 的作者 Daniel Stenberg 提到 TLS fingerprint。

先前在「修正 Curl 的 TLS handshake,避開 bot 偵測機制」與「curl 的 TLS fingerprint 偽裝專案 curl-impersonate 支援 Chrome 了」這邊有提到 curl-impersonate 這個專案,試著在 cURL 裡模擬市面上常見的瀏覽器的 TLS fingerprint。

在 Daniel Stenberg 的文章裡面也有提到這件事情,另外也提到了對 curl-impersonate 的態度:

I cannot say right now if any of the changes done for curl-impersonate will get merged into the upstream curl project, but it will also depend on what users want and how the use of TLS fingerprinting spread or changes going forward.

看起來短期內他是沒打算整,跟當初 curl-impersonate 的預期差不多...

Tor 的 Rust 計畫 Arti 推進到 1.0.0 版

在「Arti 1.0.0 is released: Our Rust Tor implementation is ready for production use.」這邊看到 TorRust 計畫進入了 1.0.0 版。

不過每次編 Rust 的東西都會發現 Rust 版本不夠新,這次也不例外,就不知道是 Rust community 的特性還是真的太少用 Rust...

    Updating crates.io index
  Downloaded arti v1.0.0
error: failed to parse manifest at `/home/gslin/.cargo/registry/src/github.com-1ecc6299db9ec823/arti-1.0.0/Cargo.toml`

Caused by:
  feature `edition2021` is required

  this Cargo does not support nightly features, but if you
  switch to nightly channel you can add
  `cargo-features = ["edition2021"]` to enable this feature

rustup update 更新後就能編了,然後跑起來看起來沒什麼問題:

$ arti proxy -p 9150
2022-09-03T17:13:30.234032Z  INFO arti: Starting Arti 1.0.0 in SOCKS proxy mode on port 9150...
2022-09-03T17:13:30.238606Z  INFO tor_circmgr: We now own the lock on our state files.
2022-09-03T17:13:30.238652Z  INFO tor_dirmgr: Didn't get usable directory from cache.
2022-09-03T17:13:30.238674Z  INFO arti::socks: Listening on 127.0.0.1:9150.
2022-09-03T17:13:30.238686Z  INFO arti::socks: Listening on [::1]:9150.
2022-09-03T17:13:30.238713Z  INFO tor_dirmgr::bootstrap: 1: Looking for a consensus.
2022-09-03T17:13:33.833304Z  INFO tor_dirmgr::bootstrap: 1: Downloading certificates for consensus (we are missing 9/9).
2022-09-03T17:13:34.335754Z  INFO tor_dirmgr::bootstrap: 1: Downloading microdescriptors (we are missing 6629).
2022-09-03T17:13:41.041683Z  INFO tor_dirmgr::state: The current consensus is fresh until 2022-09-03 17:00:00.0 +00:00:00, and valid until 2022-09-03 19:00:00.0 +00:00:00. I've picked 2022-09-03 18:35:38.290798754 +00:00:00 as the earliest time to replace it.
2022-09-03T17:13:41.061978Z  INFO tor_dirmgr: Marked consensus usable.
2022-09-03T17:13:41.065536Z  INFO tor_dirmgr: Directory is complete.
2022-09-03T17:13:41.065557Z  INFO tor_dirmgr: We have enough information to build circuits.
2022-09-03T17:13:41.065564Z  INFO arti: Sufficiently bootstrapped; system SOCKS now functional.

curl 測試也的確是 Tor 的 exit node 了:

$ curl -i --socks5 127.0.0.1:9150 https://httpbin.org/ip
HTTP/2 200 
date: Sat, 03 Sep 2022 17:21:20 GMT
content-type: application/json
content-length: 32
server: gunicorn/19.9.0
access-control-allow-origin: *
access-control-allow-credentials: true

{
  "origin": "85.93.218.204"
}

$ host 85.93.218.204
204.218.93.85.in-addr.arpa domain name pointer tor.localhost.lu.

看起來 client 的功能能用了...

WebKit 專案將從 Subversion 搬到 GitHub 上

Hacker News 首頁上看到 WebKit 專案宣佈從本來的 Subversion 搬到 GitHub 上:「WebKit on GitHub!」,新的專案位置在 WebKit/WebKit 這邊。對應的討論在「WebKit Migrates from Subversion to GitHub (webkit.org)」這邊。

但 issue tracking system 的部份目前看起來還是繼續用 WebKit Bugzilla,沒有開 GitHub 的 issue 功能,但至少是從 Subversion 換成 Git 了。

在 Hacker News 上的討論意外看到已經是歷史的 SVK (試著在 Subversion 上面堆一些功能),而且還聊到了一些 tricky 的技巧:

Funny story: my first task when I joined the original iPhone team was to merge our forked WebKit with master. It was a sort of hazing ritual slash "when else would we do it but when someone new joins?". Anyways, we used a tool called SVK[1] in order to get very primitive "git-like" abilities. It was basically a bunch of scripts that used SVN under the hood. For example, in order to get the "local everything"-style behaviors of git, the very first thing it did was checkout every single version of the repository in question. For WebKit, this meant that the first day was spent leaving the computer alone and letting it download for hours. I made the mistake of having a space somewhere in the path of the target folder, which broke something or other, so I ended up having to do it all over again.

Anyways, I distinctly remember one of the instructions for merging WebKit in our internal wiki being something like "now type `svk merge`, but hit ctrl-c immediately after! You don't want to use the built-in merge, it'll break everything, but this is the only way to get a magic number that you can find stored in [some file] after the merge has started. If it's not there, try `svk merge` again and let it go a little longer than last time." A few hires later (I think possibly a year after) someone set up a git mirror internally to avoid having to do this craziness, which if I remember correctly, was treated with some skepticism. This was 2007, so why would we try some new-fangled git thing when we had svk?

1. https://wiki.c2.com/?SvkVersionControl

我記得那個時間點 VCS 的選擇的確是個有趣的決策過程... 除了 Git 以外還有 Mercurial,另外還有幾個當時就已經算小眾的 open source solution。

而到了 2010 後就比較明朗了,現在幾乎是 Git 一統天下了,Mercurial 目前最大的使用者應該是 Meta (Facebook)?

月份傳回值 0 表示一月的考古

Hacker News 上看到「History of Zero-Based Months? (jefftk.com)」這篇,在考古為什麼常常看到 function 在傳回「月份」時是以 0 表示一月。從這篇提到的「Why is day of the month 1-indexed but the month is 0-indexed in C? (twitter.com/hillelogram)」則是 2020 年的討論。

在討論裡面有提到 hillelogram 的 tweet,裡面有個看起來算合理的考古過程...

試著用 Thread Reader 產生單頁 (讀起來會比較好讀),但不知道為什麼一直失敗,結果往 Internet Archive 翻資料,倒是有 2020 年當初生成出來的版本

另外還是列出原來第一則 tweet:

作者在研究這個題目的時候,馬上可以想到的是 C 語言裡面的月份就是 0-indexed,而其他程式語言都很有可能會是因為 C 語言的關係一路把這個特性繼承下去:

直接跳到最後面作者的猜測,他覺得可能是為了讓後續使用起來更方便的關係。

其他的欄位大多都是透過類似 sprintf("%d") 的方式直接輸出數字,所以用 1-indexed 讓人直接讀,而月份則會透過 array 來轉字串,所以用 0-indexes 讓程式轉:

So that's my best guess: the programmers were working with constrained resources and could optimize `asctime` tricky pointer arithmetic on the month and day-of-week, so made them 0-indexed. Day-of-month is just for displaying to the user, so is 1-indexed.

沒辦法確定,但就是一種猜測,看起來還蠻... 合理... 的?

Brian Kernighan 幫 AWK 加上 Unicode 支援

看到這個 commit,裡面是 Brian KernighanAWK 加上 Unicode 支援的信件。

現在在各 OS 下都有包不一樣的 AWK 實做,像是 Linux 下常見的 GNU awk,所以這個 patch 原版的 AWK 其實還好,但更多的是一種景仰的感覺...

Brian Kernighan 就是 AWK 的作者之一 (AWK 裡的 K 就是他),另外一個更為人熟知的就是經典的 The C Programming Language 這本書,或是因此而知名的簡寫,K&R 裡面的 K。

老人家八十歲還會送 patch 給現在的 maintainer 加功能...

跑在本機的 GitHub Copilot 替代品

Hacker News 上看到「FauxPilot – an attempt to build a locally hosted version of GitHub Copilot (github.com/moyix)」這個本機上跑 GitHub Copilot 協定的專案。專案的 GitHub 在「FauxPilot - an open-source GitHub Copilot server」這邊。

裡面用的是 Salesforce 放出來的 CodeGen,不過 Salesforce 提供了 350M、2B、6B 與 16B 的 model,但在 FauxPilot 這邊目前只看到 350M、6B 與 16B 的 model 可以用,少了 2B 這組,然後需要的 VRAM 就有點尷尬了:

[1] codegen-350M-mono (2GB total VRAM required; Python-only)
[2] codegen-350M-multi (2GB total VRAM required; multi-language)
[3] codegen-6B-mono (13GB total VRAM required; Python-only)
[4] codegen-6B-multi (13GB total VRAM required; multi-language)
[5] codegen-16B-mono (32GB total VRAM required; Python-only)
[6] codegen-16B-multi (32GB total VRAM required; multi-language)

13GB 剛好超過 3080 Ti 的 12GB,所以不是 3090 或 3090 Ti 的使用者就只能跑 350M 這個版本?看 Hacker News 上的討論似乎是有打算要弄 2B 的版本啦...

然後我自己雖然是 11GB 的 1080 Ti,想跑個 350M 的版本測試看看,但看起來相關的 Nvidia driver 沒裝好造成他識別不到,加上我是用 neovim,看了一下目前 ~/.config/github-copilot/hosts.json 的內容,程式碼應該是寫死到 GitHub API 上使用:

{"github.com":{"user":"gslin","oauth_token":"x"}}

先暫時放著好了,晚點等 2B 版本出現後再回來看看有沒有比較完整的指示...

原來有專有名詞:TOCTOU (Time-of-check to time-of-use)

看「The trouble with symbolic links」這篇的時候看到的專有名詞:「TOCTOU (Time-of-check to time-of-use)」,直翻是「先檢查再使用」,算是一個常見的 security (hole) pattern,因為檢查完後有可能被其他人改變,接著使用的時候就有可能產生安全漏洞。

在資料庫這類環境下,有 isolation (ACID 裡的 I) 可以確保不會發生這類問題 (需要 REPEATABLE-READ 或是更高的 isolation level)。

但在檔案系統裡面看起來不太順利,2004 年的時候研究出來沒有 portable 的方式可以確保避免 TOCTOU 的問題發生:

In the context of file system TOCTOU race conditions, the fundamental challenge is ensuring that the file system cannot be changed between two system calls. In 2004, an impossibility result was published, showing that there was no portable, deterministic technique for avoiding TOCTOU race conditions.

其中一種 mitigation 是針對 fd 監控:

Since this impossibility result, libraries for tracking file descriptors and ensuring correctness have been proposed by researchers.

然後另外一種方式 (比較治本) 是檔案系統的 API 支援 transaction,但看起來不被主流接受?

An alternative solution proposed in the research community is for UNIX systems to adopt transactions in the file system or the OS kernel. Transactions provide a concurrency control abstraction for the OS, and can be used to prevent TOCTOU races. While no production UNIX kernel has yet adopted transactions, proof-of-concept research prototypes have been developed for Linux, including the Valor file system and the TxOS kernel. Microsoft Windows has added transactions to its NTFS file system, but Microsoft discourages their use, and has indicated that they may be removed in a future version of Windows.

目前看起來的問題是沒有一個讓 Linux community 能接受的 API 設計?

玩玩文字轉圖片的 min(DALL·E)

幾個禮拜前看到「Show HN: I stripped DALL·E Mini to its bare essentials and converted it to Torch (github.com/kuprel)」這個東西,有訓練好的 model 可以直接玩文字轉圖片,GitHub 專案在「min(DALL·E) is a fast, minimal port of DALL·E Mini to PyTorch」這邊可以取得。

因為這是包裝過的版本,裝起來 & 跑起來都很簡單,但沒想到桌機的 1080 Ti 還是跑不動,只能用 CPU 硬扛了,速度上當然是比官網上面列出來用 GPU 的那些慢很多,但至少能跑起來玩看看。

首先是拿官方的句子來玩看看,第一次跑會需要下載 model (會放到我們指定的 pretrained 目錄下):

#!/usr/bin/env python3

from min_dalle import MinDalle
import torch

model = MinDalle(
    models_root='./pretrained',
    dtype=torch.float32,
    device='cpu',
    is_mega=True,
    is_reusable=False,
)

images = model.generate_image(
    text='Nuclear explosion broccoli',
    seed=-1,
    grid_size=2,
    is_seamless=False,
    temperature=1,
    top_k=256,
    supercondition_factor=32,
    is_verbose=False,
)

images = images.save('test.png')

我自己在下載過後,跑每個生成大概都需要十分鐘左右 (參數就像上面列的,CPU 是 AMD 的 5800X,定頻跑在 4.5GHz),出來的結果是這樣:

接著是一些比較普通的描述,這是 sleeping fat cats

然後來測試看看一些比較偏門的詞,像是 Lolicon,這個就差蠻多了:

但感覺有蠻多應用可以掛上去,這樣有點想買張 3090 了...

Decompile to C 的工具

昨天在 Hacker News 上看到「Decompiler Explorer (dogbolt.org)」這篇,裡面列出了很多 Decompile to C 的工具 (就不用直接硬看 assembly),包括了 open source 與商用軟體:

網站本身則是提供界面可以交叉比較,不過各家的結果看起來還是有侷限...

用 GPT-3 解讀程式碼

Hacker News 上看到的方法,Simon Willison 試著把程式碼餵進 GPT-3,然後問 GPT-3 程式碼的意思,看起來答的還不錯:「Using GPT-3 to explain how code works」,對應的討論 (包括 Simon Willison 的回應) 則可以在「Using GPT-3 to explain how code works (simonwillison.net)」這邊看到。

第一個範例裡面可以解讀 regular expression,雖然裡面對 (?xm) 的解讀是錯的,但我會說已經很強了...

第二個範例在解釋 Shadow DOM,看起來也解釋的很不錯...

第三個範例回來原來產生程式碼的例子,拿來生 SQL 指令。

後面的 bonus 題目居然是拿來解釋數學公式,他直接丟 TeX 文字進去要 GPT-3 解釋柯西不等式 (Cauchy–Schwarz inequality)。這樣我想到以前高微作業常常會有一堆證明題,好像可以丟進去要 GPT-3 給證明耶...