Camerfirma 的情況

Solidot 上看到「Google 将屏蔽西班牙 CA Camerfirma」這篇,裡面是引用了 ZDNet 的「Google bans another misbehaving CA from Chrome」這則報導,看了一下相關的資料,好像還蠻精彩的...

在「Summary of Camerfirma's Compliance Issues」這邊有不少人的討論,另外 GoogleRyan Sleevi 宣佈將不會信任 Camerfirma CA 的公告則可以在這邊看到。

Mozilla 有在他們自家的 wiki 上面整理出 Camerfirma 發生過的各種問題:「CA:Camerfirma Issues」。

開頭的「Issue B: Delegation of validation to StartCom following Mozilla's distrust of StartCom (March 2017)」就看到讓人頗意外的關係,原來這家公司跟當年的 WoSign & StartCom 有關:

In March 2017, following the distrust of WoSign/StartCom, Camerfirma entered into a business relationship with StartCom. As part of that relationship, StartCom's CEO Iñigo underwent training as an RA for Camerfirma. As described, Iñigo performed organizational and domain validation for customers of StartCom, and then instructed Camerfirma to issue the certificates.

花了大概二十分鐘翻完,這戰績有點輝煌... Mozilla 到目前還是沒做決定,反倒是 Google 先砍了第一刀?

PostgreSQL 的 scale 建議

Hacker News Daily 上看到「Postgres scaling advice for 2021」這篇,講 PostgreSQL 要怎麼 scale,在 Hacker News 上也有對應的討論可以看:「Postgres scaling advice (cybertec-postgresql.com)」。

文章前面先提到分散式系統的複雜度會導致 RDBMS 上的一些假設失效,所以如果可以用單台機器暴力解,就儘量用單台機器來解 (scale up 的情境),裡面就提到了一些「暴力可以解決很多問題」的說明,差不多就是前幾天提到的「Let's Encrypt 升級資料庫伺服器 (AMD YES?)」。

後面提到如果真的要放進分散式的 RDBMS (scale out 的情境),怎麼設計資料結構會比較好。

這邊剛好也可以提一下,量夠大的時候要把 OLTPOLAP 的應用分開,現在有很多 OLAP 資料庫可以選擇,同步的工具也很成熟了,通常效能會比在 OLTP 上面硬跑來的好。

最後提一下,文章裡面對於 transaction per second 可以拉很高,有些假設沒有明寫出來。這需要盡可能把 transaction 拆小,避免常常有 giant transaction 卡住整個資料庫,這點對於一般的系統會需要做不少改寫...

不過最後比較疑惑的是,這種文章怎麼會上 Hacker News 的啊...

Backblaze 在 2020 年對機械硬碟的回顧

前幾天 Backblaze 放了 2020 年的回顧資料出來:「Backblaze Hard Drive Stats for 2020」。

整體的 AFR (Annualized Failure Rate) 在 0.93% 左右,而如果照品牌拆開,HGST 的數字依然是最漂亮的 (雖然他現在是 WD 的品牌),大約在 0.36% 左右 (111/(1083774+4663049+372000+820272+275779+3968475)),Toshiba 次之,大約低了平均值一些落在 0.89%,而 Seagate 光是看就就知道會超過 1%...

官方有提到,低於 250,000 drive days 以下的數據僅供參考,因為資料量太少,在統計上無法提供結論:

For drives which have less than 250,000 drive days, any conclusions about drive failure rates are not justified. There is not enough data over the year-long period to reach any conclusions. We present the models with less than 250,000 drive days for completeness only.

然後 WD 本家的硬碟回到戰線了,記得之前基本上算是被唾棄 XDDD

另外一張表則是講到這三年的情況,可以看出來 2020 年的 AFR 數字降了不少,裡面也解釋了為什麼 (看起來就是活下來的穩下來了...):

The answer: It was a group effort. To start, the older drives: 4TB, 6TB, 8TB, and 10TB drives as a group were significantly better in 2020, decreasing from a 1.35% AFR in 2019 to a 0.96% AFR in 2020. At the other end of the size spectrum, we added over 30,000 larger drives: 14TB, 16TB, and 18TB, which as a group recorded an AFR of 0.89% for 2020. Finally, the 12TB drives as a group had a 2020 AFR of 0.98%. In other words, whether a drive was old or new, or big or small, they performed well in our environment in 2020.

用 GPT-3 直接產生對應的 SQL query

Hacker News Daily 上看到的東西,直接給 GPT-3 條件,叫 GPT-3 給出對應的 SQL query,這就有點誇張了...:「Automating My Job with GPT-3」。

這樣丟進去:

Instruction: Given an input question, respond with syntactically correct PostgreSQL. Be creative but the SQL must be correct.

Input: how many users signed up in the past month?

然後這樣出來:

GPT-3 Response: SELECT COUNT(*) FROM users
WHERE signup_time > now() - interval '1 month'

把欄位名稱替換掉就可以用了... 而且接下來就更誇張了,直接要求 GPT-3 也要照著設定的 table schema 給答案:

Instruction: Given an input question, respond with syntactically correct PostgreSQL. Be creative but the SQL must be correct. Only use tables called "users" and "charges". The "users" table has columns: id (integer), signup_dt (timestamp), email (character varying), and plan_type (character varying). The "charges" table has columns: amount (bigint), user_id (integer), and charge_dt (timestamp).

Input: how much revenue did we have in the past 7 days?

然後輸出了:

GPT-3 Response: SELECT SUM(amount) FROM charges WHERE charge_dt > now() - interval '7 days'

接下來是在同樣 instruction 下,跨表格的問題:

Input: how much revenue have we had from users that signed up in the last 6 months?

這時候 INNER JOIN 就跑出來了:

.8 Temperature GPT-3 Response: SELECT SUM(charges.amount) FROM users INNER JOIN charges ON users.id = charges.user_id WHERE signup_dt >= DATE_SUB(now(), INTERVAL '6 months')

後面的問題也很精彩,看起來之後可以接上 BI dashboard,直接丟句子進去,然後拉各種資料出來視覺化?

Firefox 的 Cache Partition 也生效了

Google Chrome 實做了 Cache Partition (「Google Chrome 的 Cache Partition 生效」),Firefox 也實做的版本也進 standard 版本了:「Firefox 85 Cracks Down on Supercookies」。

In Firefox 85, we’re introducing a fundamental change in the browser’s network architecture to make all of our users safer: we now partition network connections and caches by the website being visited.

這要防禦的手法可以參考先前在「Google Chrome 要藉由拆開 HTTP Cache 提昇隱私」這邊提到的方法,主要就是共用 cache 時可以利用抓檔案的時間來判斷 (side channel information)。

這樣主流的瀏覽器都支援 Cache Partition 了...

Eventbrite 的 MySQL 升級計畫

在 2021 年看到 EventbiteMySQL 升級計畫:「MySQL High Availability at Eventbrite」。

看起來是 2019 年年初的時候 MySQL 5.1 出問題,後續決定安排升級,在 2019 年年中把系統升級到 MySQL 5.7 (Percona Server 版本):

Our first major hurdle was to get current with our version of MySQL. In July, 2019 we completed the MySQL 5.1 to MySQL 5.7 (v5.7.19-17-log Percona Server to be precise) upgrade across all MySQL instances.

然後看起來是直接在 EC2 上跑,不過這邊提到的空間問題就不太確定了,是真的把 EBS 的空間上限用完嗎?比較常使用的 gp2gp3 上限都是 16TB,不確定是不是真的用到接近爆掉了:

Not only was support for MySQL 5.1 at End-of-Life (more than 5 years ago) but our MySQL 5.1 instances on EC2/AWS had limited storage and we were scheduled to run out of space at the end of July. Our backs were up against the wall and we had to deliver!

另外在升級到 5.7 的時候,順便把本來是 INT 的 primary key 都換成 BIGINT

As part of the cut-over to MySQL 5.7, we also took the opportunity to bake in a number of improvements. We converted all primary key columns from INT to BIGINT to prevent hitting MAX value.

然後系統因為舊版的 Django 沒辦法配合 MySQL 5.7,得升級到 Django 1.6 (要注意 Django 1 系列的最新版是 1.11,看起來光是升級到 1.6 勉強會動就升不上去了?):

In parallel with the MySQL 5.7 upgrade we also Upgraded Django to 1.6 due a behavioral change in MySQL 5.7 related to how transactions/commits were handled for SELECT statements. This behavior change was resulting in errors with older version of Python/Django running on MySQL 5.7

然後採用了 GitHub 家研發的 gh-ost 當作改變 schema 的工具:

In December 2019, the Eventbrite DBRE successfully implemented a table ALTER via gh-ost on one of our larger MySQL tables.

看起來主要的原因是有遇到 pt-online-schema-change 的限制 (在「GitHub 發展出來的 ALTER TABLE 方式」這邊有提到):

Eventbrite had traditionally used pt-online-schema-change (pt-osc) to ALTER MySQL tables in production. pt-osc uses MySQL triggers to move data from the original to the “duplicate” table which is a very expensive operation and can cause replication lag. Matter of fact, it had directly resulted in several outages in H1 of 2019 due to replication lag or breakage.

另外一個引入的技術是 Orchestrator,看起來是先跟 HAProxy 搭配,不過他們打算要再換到 ProxySQL

Next on the list was implementing improvements to MySQL high availability and automatic failover using Orchestrator. In February of 2020 we implemented a new HAProxy layer in front of all DB clusters and we released Orchestrator to production!

Orchestrator can successfully detect the primary failure and promote a new primary. The goal was to implement Orchestrator with HAProxy first and then eventually move to Orchestrator with ProxySQL.

然後最後題到了 Square 研發的 Shift,把 gh-ost 包裝起來變成有個 web UI 可以操作:

2021 還可以看到這類文章還蠻有趣的...

更新 Sudo (CVE-2021-3156)

Sudo 這次的安全性漏洞頗痛的:「CVE-2021-3156: Heap-Based Buffer Overflow in Sudo (Baron Samedit)」。

依照 Sudo Security Alerts 這邊的說明,這次的漏洞只要是本機有執行權限的人都有機會打穿,不需要有 sudo 帳號權限:

A potential security issue exists in sudo that could be used by a local user to gain root privileges even when not listed in the sudoers file. Affected sudo versions are 1.8.2 through 1.8.31p2 and 1.9.0 through 1.9.5p1. Sudo 1.9.5p2 and above are not affected.

不限於本來就有 sudo 帳號 (但可能可執行指令受限) 就比較麻煩了,這代表從 web 之類的管道打進去以後 (可能是 www-data 身份),就可以用這個洞取得 root 權限,另外一條路是透過資料庫打進去後 (像是 mysql 身份) 取得。

該來把一堆機器更新了...

GET 與 POST 的差異

看到這篇在講 HTTP (& HTTPS) 裡面 GET 與 POST 的差異,剛好把一些標準的定義拿出來翻一翻,算是複習基本概念:「Get safe」。

第一個基本概念主要是 idempotence (& idempotent),重複被呼叫不會造成狀態的再次改變:

Idempotence ([...]) is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application.

數學定義是這樣跑:

An element x of a magma (M, •) is said to be idempotent if:

x • x = x.

If all elements are idempotent with respect to •, then • is called idempotent. The formula ∀x, x • x = x is called the idempotency law for •.

這點在 HTTP 標準 (RFC 7231) 裡面的定義也類似:

A request method is considered "idempotent" if the intended effect on the server of multiple identical requests with that method is the same as the effect for a single such request. Of the request methods defined by this specification, PUT, DELETE, and safe request methods are idempotent.

第二個基本概念是 Safe method (也是在同樣的 RFC 裡被提到),主要的思想是 read-only,這也是文章作者的標題要講的事情:

Request methods are considered "safe" if their defined semantics are essentially read-only; i.e., the client does not request, and does not expect, any state change on the origin server as a result of applying a safe method to a target resource. Likewise, reasonable use of a safe method is not expected to cause any harm, loss of property, or unusual burden on the origin server.

然後標準的 HTTP method 是有定義的:

   +---------+------+------------+---------------+
   | Method  | Safe | Idempotent | Reference     |
   +---------+------+------------+---------------+
   | CONNECT | no   | no         | Section 4.3.6 |
   | DELETE  | no   | yes        | Section 4.3.5 |
   | GET     | yes  | yes        | Section 4.3.1 |
   | HEAD    | yes  | yes        | Section 4.3.2 |
   | OPTIONS | yes  | yes        | Section 4.3.7 |
   | POST    | no   | no         | Section 4.3.3 |
   | PUT     | no   | yes        | Section 4.3.4 |
   | TRACE   | yes  | yes        | Section 4.3.8 |
   +---------+------+------------+---------------+

不過文章裡面提到的第一個例子並沒有很好,POST 不保證 safe 沒錯,但不代表 safe operation 就不能用 POST。

這邊用 URI resource 的概念 (以及 SEO?) 或是用 Post/Redirect/Get 的概念來說明會比較好:

<form method="get" action="/search">
<input type="search" name="term">

不過文章後續提到的問題的確就是我自己都會犯錯的問題:

“Log out” links that should be forms with a “log out” button—you can always style it to look like a link if you want.

“Unsubscribe” links in emails that immediately trigger the action of unsubscribing instead of going to a form where the POST method does the unsubscribing. I realise that this turns unsubscribing into a two-step process, which is a bit annoying from a usability point of view, but a destructive action should never be baked into a GET request.

這兩個動作都會造成 server 端的狀態改變,不應該用 GET,而我自己常常忘記第一個... 這邊其實可以用 form 產生 POST 需求,並且用 css 效果包起來,達到看起來跟一般的連結一樣。

寫起來讓自己多一點印象,之後避免再犯一樣的錯誤...

KataGo 的分散式訓練計畫啟動了

KataGo 應該是目前 open source 領域裡面數一數二強的圍棋引擎,在去年就一直在開發可以讓大家參與的分散式訓練計畫,最近釋出了 v1.8.0 版,算是公開啟動了:「KataGo Distributed Training」,作者在「KataGo distributed training is open!」這邊也有大概寫一下。

基本上照著官方網站上面的說明做就可以了,可以下載 precompiled binary 或是自己編,自己編的時候注意不能直接拿 master branch 裡面編 (client hash 會不對),我自己目前是用 v1.8.0 這個版本編出來跑。

Reddit 上面的「KataGo's new run is open for public contributions!」也可以看到說明的圖片 (要注意圖上的 X 軸不是線性),算是接著本來的 g170 訓練下去,另外也標示了 ELFv2Leela Zero 大致上的強度:

目前看起來陸陸續續有人開始參與了...

另外在 CGOS 上面也可以看到 kata1 開頭的 bot 在跑,而且看起來會一直把新的 training 成果更新上去跑。

AWS 的 T4g 系列機器增加服務區域

先前在「AWS 推出了 ARM 平台上 T 系列的機器」這邊提到 Amazon EC2 推出採用 ARM 系列的 t4g.*,當時亞洲區只有東京與孟買可以使用,現在在更多區域都推出上線了:「Announcing new Amazon EC2 T4g instances powered by AWS Graviton2 processors along with a T4g free trial in Asia Pacific (Sydney, Singapore), Europe (London), North Americas (Canada Central, San Francisco), and South Americas (Sao Paulo) regions」。

抓了一下新加坡的價錢:

t4g.nano 2 N/A 0.5 GiB EBS Only $0.0053 per Hour
t3.nano 2 Variable 0.5 GiB EBS Only $0.0066 per Hour
t3a.nano 2 Variable 0.5 GiB EBS Only $0.0059 per Hour

可以來測一些東西看看如何了...