PostgreSQL 的 SERIALIZABLE 的 bug

這是 Jespen 第一次測試 PostgreSQL,就順利找出可重製的 bug 了:「PostgreSQL 12.3」。

第一個 bug 是 REPEATABLE READ 下的問題,不過因為 SQL-92 定義不夠嚴謹的關係,其實算不算是 bug 有討論的空間,這點作者 Kyle Kingsbury 在文章裡也有提出來:

Whether PostgreSQL’s repeatable-read behavior is correct therefore depends on one’s interpretation of the standard. It is surprising that a database based on snapshot isolation would reject the strict interpretation chosen by the seminal paper on SI, but on reflection, the behavior is defensible.

另外一個就比較沒問題了,是 SERIALIZABLE 下的 bug,在 SQL-92 下對 SERIALIZABLE 的定義是這樣:

The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable. A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions. A serial execution is one in which each SQL-transaction executes to completion before the next SQL-transaction begins.

也就是說,在 SERIALIZABLE 下一堆 transaction 的執行結果,你至少可以找到一組排序,使得這些 transaction 的結果是等價的。

而 Jespen 順利找出了一組 transaction (兩個 transaction),在 SERIALIZABLE 下都成功 (但不應該成功):

對於這兩個 transaction,不論是上面這條先執行,還是下面這條先執行,都不存在等價的結果,所以不符合 SERIALIZABLE 的要求。

另外也找到一個包括三個 transaction 的情況:

把 transaction 依照執行的結果把 dependency 拉出來,就可以看出來裡面產生了 loop,代表不可能在 SERIALIZABLE 下三個都成功。

在 Jespen 找到這些 bug 後,PostgreSQL 方面也找到軟體內產生 bug 的部份,並且修正了:「Avoid update conflict out serialization anomalies.」,看起來是在 PostgreSQL 引入 Serializable Snapshot Isolation (SSI) 的時候就有這個 bug,所以 9.1 以後的版本都有這個問題...

這次順利打下來,測得很漂亮啊... 翻了一下 Jespen 上的記錄,發現好像還沒測過 MySQL,應該會是後續的目標?

Intel 的 RDRAND 爆炸...

在正妹 wens 的 Facebook 上看到的,IntelRDRAND 因為有安全漏洞 (CrossTalk/SRBDS),新推出的修正使得 RDRAND 只有原來的 3% 效能:

從危機百科上看,大概是因為這個指令集有 compliance 的要求,所以這個安全性漏洞必須在安全性上修到乾淨,所以使用了暴力鎖硬解,造成效能掉這麼多:

The random number generator is compliant with security and cryptographic standards such as NIST SP 800-90A, FIPS 140-2, and ANSI X9.82.

不過畢竟這個指令不是常常被使用,一般使用者的影響應該是還好:

As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.

另外這個漏洞早在 2018 九月的時候就通報 Intel 提了,但最後花了超過一年半時間才更新,這算是當初在提 Bug Bounty 制度時可能的缺點,在這次算是比較明顯:

We disclosed an initial PoC (Proof-Of-Concept) showing the leakage of staging buffer content in September 2018, followed by a PoC implementing cross-core RDRAND/RDSEED leakage in July 2019. Following our reports, Intel acknowledged the vulnerabilities, rewarded CrossTalk with the Intel Bug Bounty (Side Channel) Program, and attributed the disclosure to our team with no other independent finders. Intel also requested an embargo until May 2020 (later extended), due to the difficulty of implementing a fix for the cross-core vulnerabilities identified in this paper.

回到原來的 bug,主要還是 Intel 架構上的問題造成大家打得很愉快,現在 Intel 這邊的架構對於資安研究員仍然是個大家熱愛的地方... (因為用的使用者太多)

Let's Encrypt 在檢查 CAA 時出包

Let's Encrypt 發現在檢查 CAA 的程式碼有問題,發了說明:「2020.02.29 CAA Rechecking Bug」,以及預定的處理方式:「Revoking certain certificates on March 4」。

問題是當一個 certificate request 包含了 N 個 domain 時,本來的 CAA 檢查應該要對這 N 個檢查,但程式寫成只會抓一個,然後檢查了 N 次:

The bug: when a certificate request contained N domain names that needed CAA rechecking, Boulder would pick one domain name and check it N times. What this means in practice is that if a subscriber validated a domain name at time X, and the CAA records for that domain at time X allowed Let’s Encrypt issuance, that subscriber would be able to issue a certificate containing that domain name until X+30 days, even if someone later installed CAA records on that domain name that prohibit issuance by Let’s Encrypt.

2020/02/29 發現的,就程式碼的部屬時間,發現應該從去年 2019/07/25 開始就有這個 bug:

We confirmed the bug at 2020-02-29 03:08 UTC, and halted issuance at 03:10. We deployed a fix at 05:22 UTC and then re-enabled issuance.

Our preliminary investigation suggests the bug was introduced on 2019-07-25. We will conduct a more detailed investigation and provide a postmortem when it is complete.

然後決定要 revoke 這些可能會有問題的 SSL certificate,大約佔現有還有效的 SSL certificate 的 2.6%,大約三百萬筆:

Q: How many certificates are affected?
A: 2.6%. That is 3,048,289 currently-valid certificates are affected, out of ~116 million overall active Let’s Encrypt certificates. Of the affected certificates, about 1 million are duplicates of other affected certificates, in the sense of covering the same set of domain names.

在「Check whether a host's certificate needs replacement」這邊可以偵測線上使用的 SSL certificate 是否受到影響。

另外在「Download affected certificate serials for 2020.02.29 CAA Rechecking Incident」這邊可以抓到所有受到影響,預定要 revoke 的 SSL certificate 的序號。關於取得序號的方式,官方也有提供 CLI 的指令可以操作確認,對於有很多網域名稱需要確認的人可以用這組指令編寫程式判斷:

openssl s_client -connect example.com:443 -servername example.com -showcerts </dev/null 2>/dev/null | openssl x509 -text -noout | grep -A 1 Serial\ Number | tr -d :

照目前的描述,如果申請時只有一個 domain 應該是不會中這個問題,再來是最壞的情況大概會維持三個月 (網站主人沒管他,等到時間到了自動 renew)。

家裡電腦裝 Ubuntu 18.04

上個禮拜四家裡的桌機開不了機,找了一天發現是系統的 SSD 掛掉了,就買了張 M.2 SSD,然後計畫順便把本來的 Ubuntu 16.04 升級到 Ubuntu 18.04,但 Ubuntu 18.04 把預設的界面從 Unity 換成 GNOME (然後披上 Unity 的皮),加上前陣子系統從 Intel 平台換到 AMD,整個狀況變得超混亂之後,就變成一連串踩地雷的過程...

最一開始是 UEFI + LUKS 的安裝問題,本來想裝到 M.2 SSD 上面,但 Ubuntu 18.04 的 grub-install 就是硬寫到 /dev/sda 不能改:「“Unable to install GRUB in /dev/sda” when installing GRUB」,照著這篇的 workaround 用還是不行,最後放棄,直接生一顆 SATA SSD 接到 SATA Port 1,把 M.2 當作資料碟。

硬體相關的問題:

軟體相關的問題:

  • 目前不支援從 GUI 設定 PPPoE 的網路 (沃槽),幾種方式裡面我推薦用 pppoeconf 設定會比較好,然後可以改 /etc/ppp/options 加上 IPv6 的設定。
  • 本來想裝 gnome-shell-extension-system-monitor 觀察系統狀態,但會造成系統超級卡,關掉後就變成普通的卡 (後來就找到 Intel I211-AT 的那個問題了)。

現在至少是堪用的程度了,接下來就是不斷的補各種設定...

Google Chrome 對 CPU bug 的 patch

既然有方向了,後續應該會有人去找底層的問題...

先是在 Hacker News 上看到「Speculative fix to crashes from a CPU bug」這個猜測性的修正,這是因為他們發現在 IntelGemini Lake 低功耗晶片組上會發生很詭異的 crash:

For the last few months Chrome has been seeing many "impossible" crashes on Intel Gemini Lake, family 6 model 122 stepping 1 CPUs. These crashes only happen with 64-bit Chrome and only happen in the prologue of two functions. The crashes come and go across different Chrome versions.

然後依照 crash log 猜測跟 alignment 有關,所以決定用 gcc/clang 都有支援的 __attribute__ 強制設定 alignment 來避開,但看起來手上沒有可以重製的環境,所以只能先把實做丟上來...

各種對 AWS Managemenet Console 的抱怨...

Hacker News Daily 上看到 Reddit 上面有一篇對 AWS Management Console 的抱怨文,差不多是兩個月前開始累積的:「I am stupefied every day by the awfulness of the AWS web console」。

AWS 的主力開發因為是以 API 為主,而 AWS Management Console 能做的事情一直都少蠻多的 (看起來是一個團隊在開發,然後呼叫 API),而且的確是常常中 bug,所以會有這樣的抱怨其實不太意外...

然後就有人放火了:

[–]canadian_sysadmin 24 points 2 months ago
I see you've never used Azure...

[–]myron-semack 18 points 2 months ago
AWS’s console sucks because they don’t give a damn about UI. They are API-first.

Azure’s console sucks because they tried to make it nice but failed.

[–]ryantiger658 5 points 2 months ago
I was scrolling looking for this comment. Azures interface has made me appreciate AWS even more.

Azure 被偷戳了好幾下 XDDD 然後 GCP 也被偷戳了:

[–]edgan 1 point 2 months ago
It could br better, but it is far better than than Azure and GCP. Azure's old one was better than their new beta interface last I saw it. GCP has some interesting ideas, but the side bar centric design doesn't function well. It also tries to do too much, and is too JavaScript-y happy.

通常用 AWS 自己的 CloudFormation 或是第三方的 Terraform 管理還是比較常見的方式 (基於 Infrastructure as code 的概念),而 AWS Managemenet Console 當作是輔助,因為目前的雲端服務在設計上的確是希望你多用 API...

Raspberry Pi 4 的 Type C 無法使用 Macbook Charger 供電的問題

Raspberry Pi 4 出來後有些災情 (畢竟又加了不少東西近去),在 Hacker News 上看到的 Type C 介面的充電問題:「Raspberry Pi 4 not working with some chargers (scorpia.co.uk)」,引用的原文可以在「Pi4 not working with some chargers (or why you need two cc resistors)」這邊看到,裡面提到了新的 Type C 供電介面在接某些充電器時不會供電 (包括了 Macbook 的充電器):

The new pi has been released and it has a USB Type-C connector for power however people are finding some chargers are not working with it (notably macbook chargers). Some have speculated that this is due to a manufacturer limitation on the power supplies however it is actually due to the incorrect detection circuitry on the Pi end of the USB connection.

這樣說有點偏頗,但是 Macbook 的充電器一向是 Type C 裡的指標,如果這顆充電器跟其他裝置配合上有問題,通常都是代表其他裝置的實作有問題... (噗)

這次發現的電阻問題看起來有點苦 (看起來需要改版子),目前文章作者建議的 workaround 主要就是「不要用那麼好的設備」,比較簡單的包括了 Type C 的線不要那麼好 (像是找充手機用的線就好,不要找拿可以跑 5A 的線),或是透過 Type A 轉 Type C 的線也應該可以避開這個問題,最差的情況應該是找其他的充電器:

Now onto some solutions. Assuming the issue you are having is caused by the problem discussed above, using a non e-marked cable (most USB-C phone charger cables are likely this type) rather than an e-marked cable (many laptop charger/thunderbolt cables and any 5A capable cable will be in this category) will allow for the pi to be powered. In addition using older chargers with A-C cables or micro B to C adaptors will also work if they provide enough power as these don’t require CC detection to provide power. Ultimately though the best solution in the long run will be for there to be a board revision for the pi 4 which adds the 2nd CC resistor and fixes the problem.

對於已經入手的人,如果真的中獎,workaround cost 應該還在可以控制的範圍...

歐盟對十四套 Open Source 軟體推出 Bug Bounty Program

歐盟對於 14 套 open source 軟體推出 bug bounty program,協助改善這些軟體的品質 (主要是資安這塊):「EU to fund bug bounty programs for 14 open source projects starting January 2019」、「In January, the EU starts running Bug Bounties on Free and Open Source Software」。

這十四套軟體的選擇應該可以參考「EU aims to increase the security of password manager and web server software: KeePass and Apache chosen for open source audits」這邊...

然後看到「Intigriti/Deloitte」這個才知道原來 Deloitte 也有做這個啊...

Ubuntu 撥 HiNet PPPoE 時會因為 MTU 而導致有些網站連不上

之前用 HiNet 固定制 (不需要 PPPoE,直接設 IP 就會通的那種),跑起來順順的也沒麼問題,最近剛好合約滿了就打算換成非固定制 (需要撥 PPPoE 才會通),結果換完後發現有些網站常常連不上 (不是一直都連不上),但只要設了 proxy.hinet.net (今年年底要停止服務了) 或是改從 cable 線路出去就正常。

測了不少設定都沒用 (像是改 tcp timestamp 設定,或是 sack 之類的設定),後來發現 MTU 的值不太對,用 ifconfig 看發現我的 ppp01500 而不是 1492,直接先 ifconfig ppp0 mtu 1492 改下去測,發現本來不能連的網站就通了...

(補充一下,我看了 Windows 的設定是 1480,所以也沒問題,但不知道怎麼算的...)

查了一下 MTU 相關的問題,發現在「wrong mtu value on dsl connection」這邊有討論到。裡面提到的 workaround 是到 /etc/NetworkManager/system-connections/ 裡找出你的 PPPoE 設定檔,然後在 ppp 區域的裡面寫死 mtu 參數:「mtu=1492」(這邊的 1492 是從 1500 bytes 扣掉 PPPoE 的 8 bytes 得出來的),不過我測試發現在修改設定檔時會被改回來,加上測試發現沒用,只好自己寫一個 /etc/network/if-up.d/pppoe-mtu 惡搞了:

#!/bin/sh -e

if [ "$IFACE" != "ppp0" ]; then
        exit 0
fi

/sbin/ifconfig ppp0 mtu 1492

放進去後要記得 chmod 755

從 ticket 上面看起來還是沒有解 (2009 年就發現了),看起來 PPPoE 不是絕對多數而且又有 workaround,短期應該不會修正...

CVE-2018-14665:setuid 複寫檔案的 security issue...

Twitter 上看到的 security issue,好久沒在這麼普及的軟體上看到這種 bug 了:

CVE - CVE-2018-14665 的說明裡面有提到 1.20.3 前的版本都有中,但沒講到從哪個版本開始,看起來是全系列...?

A flaw was found in xorg-x11-server before 1.20.3. An incorrect permission check for -modulepath and -logfile options when starting Xorg. X server allows unprivileged users with the ability to log in to the system via physical console to escalate their privileges and run arbitrary code under root privileges.

這一臉 orz...