calloc() 與 malloc() 的差異

前陣子在 Hacker News Daily 上看到的,原文是 2016 的文章:「Why does calloc exist?」,裡面講的東西包括了 implementation dependent 的項目,所以要注意一下他的結論未必適用於所有的平台與情境。

malloc()calloc() 的用法是這樣,其中 calloc() 會申請 countsize 的空間:

void* buffer1 = malloc(size);
void* buffer2 = calloc(count, size);

第一個差異是,count * size 可能會 overflow (而 integer overflow 在 C 裡面是 undefined behavior),這點除非你在乘法時有檢查,不然大多數的行為都還是會生一個值出來。

calloc() 則是會幫你檢查,如果會發生 overflow 的時候就不會真的去要一塊記憶體用。

第二個差異是 calloc() 保證會將內容都設定為 0,這點在 POSIX 的標準裡面是這樣寫的:

The calloc() function shall allocate unused space for an array of nelem elements each of whose size in bytes is elsize. The space shall be initialized to all bits 0.

但作者就發現 malloc() + memset() + free() 還是比 calloc() + free() 慢很多:

~$ gcc calloc-1GiB-demo.c -o calloc-1GiB-demo
~$ ./calloc-1GiB-demo
calloc+free 1 GiB: 3.44 ms
malloc+memset+free 1 GiB: 365.00 ms

研究發現是 calloc() 用了 copy-on-write 的技巧,先把所有的 page 都指到同一塊完全被塞 0 的記憶體,只有在真的寫到該段記憶體時,系統才會要一塊空間來用:

Instead, it fakes it, using virtual memory: it takes a single 4 KiB page of memory that is already full of zeros (which it keeps around for just this purpose), and maps 1 GiB / 4 KiB = 262144 copy-on-write copies of it into our process's address space. So the first time we actually write to each of those 262144 pages, then at that point the kernel has to go and find a real page of RAM, write zeros to it, and then quickly swap it in place of the "virtual" page that was there before. But this happens lazily, on a page-by-page basis.

但畢竟這是 implementation dependent,看看有個印象就好。

C 語言裡面的 ??! 符號

Hacker News Daily 上看到這個奇怪的知識:「What does the ??!??! operator do in C? (stackoverflow.com)」,原文在 Stack Overflow 上:「What does the ??!??! operator do in C?」。

這是 trigraph,在 C89 就有了,從 Rationale for International Standard—Programming Languages—C 這邊的 5.2.1.1 可以看到 trigraph 的歷史原因:

Trigraph sequences were introduced in C89 as alternate spellings of some characters to allow the implementation of C in character sets which do not provide a sufficient number of non-alphabetic graphics

而且是強制要求實做:

Implementations are required to support these alternate spellings, even if the character set in use is ASCII, in order to allow transportation of code from systems which must use the trigraphs. AMD1 also added digraphs (see §6.4.6 and §MSE.4).

其中遇到的問題就是當年得決定 C 可以用的 charset,得考慮到很多不同機器 charset 相容性的問題:

The C89 Committee faced a serious problem in trying to define a character set for C. Not all of the character sets in general use have the right number of characters, nor do they support the graphical symbols that C users expect to see. For instance, many character sets for languages other than English resemble ASCII except that codes used for graphic characters in ASCII are instead used for alphabetic characters or diacritical marks. C relies upon a richer set of graphic characters than most other programming languages, so the representation of programs in character sets other than ASCII is a greater problem than for most other programming languages.

然後就使用了 ISO/IEC 646 這個標準 (要記得 Unicode 1.0.0 是 1991 年才出現):

The solution is an internationally agreed-upon repertoire in terms of which an international representation of C can be defined. ISO has defined such a standard, ISO/IEC 646, which describes an invariant subset of ASCII.

The characters in the ASCII repertoire used by C and absent from the ISO/IEC 646 invariant repertoire are:

[ ] { } \ | ~ ^

後面就是定義 ?? 當作 escape digraph。

算是一個歷史產物,現在不太需要用到了...

TCP 標準被整理到 RFC 9293

看到「RFC 9293: Transmission Control Protocol (TCP)」這篇,主要是把本來分散在各個 RFC 的文件 (從 RFC 793 開始) 全部整理成一份,另外把一些已知的勘誤表放進來:

This document specifies the Transmission Control Protocol (TCP). TCP is an important transport-layer protocol in the Internet protocol stack, and it has continuously evolved over decades of use and growth of the Internet. Over this time, a number of changes have been made to TCP as it was specified in RFC 793, though these have only been documented in a piecemeal fashion. This document collects and brings those changes together with the protocol specification from RFC 793. This document obsoletes RFC 793, as well as RFCs 879, 2873, 6093, 6429, 6528, and 6691 that updated parts of RFC 793. It updates RFCs 1011 and 1122, and it should be considered as a replacement for the portions of those documents dealing with TCP requirements. It also updates RFC 5961 by adding a small clarification in reset handling while in the SYN-RECEIVED state. The TCP header control bits from RFC 793 have also been updated based on RFC 3168.

然後淘汰掉 (obselete) 一卡車 RFC 文件 XD

翻資料發現 2014 的時候 HTTP/1.1 被幹過一次類似的事情,不過是反過來被拆開:「HTTP/1.1 的更新」,這次把 RFC 2616 幹掉分成 RFC 7230RFC 7235

然後今年因為 HTTP 的關係又被幹了一次,這次 HTTP/1.1 又被整回來變成一份文件,但是把裡面的一些概念拆開:「HTTP 標準的翻新」。

其中 RFC 9110 定義 HTTP Semantics,RFC 9111 定義 HTTP Caching,然後 RFC 9112RFC 9113 拿來定義了 HTTP/1.1 與 HTTP/2,另外先把 HTTP/3 的號碼保留下來的 RFC 9114

不斷 refactor 以及加新功能的文件...

Mozilla 把對各種規格的態度整理成一個網頁

查資料的時候看到「Mozilla Specification Positions」這個,可以看到 Mozilla 對各種規格的態度,分成幾個不同的類型。

有「under consideration」、「important」、「worth prototyping」這幾個從名字上就大概猜的出來意思,接下來的幾個比較有趣的是「non-harmful」、「defer」與「harmful」。

官方的說明是:

  • under consideration: Mozilla's position on this specification is being discussed.
  • important: This specification is conceptually good and is important to Mozilla.
  • worth prototyping: Mozilla sees this specification as conceptually good, and worth prototyping, getting feedback on its value, and iterating.
  • non-harmful: Mozilla does not see this specification as harmful, but is not convinced that it is a good approach or worth working on.
  • defer: Mozilla does not intend to look at this specification at all in the near future.
  • harmful: Mozilla considers this specification to be harmful in its current state.

所以之後看到一些標準時可以過來這邊看看 Mozilla 的態度...

可以在 Cat5 上面跑 1km 的 Ethernet 標準 10BASE-T1L

Hacker News 上看到「SPEBlox-Long (10BASE-T1L) - 10Mbps, 1km range」這個產品,看到 10BASE-T1L 這個標準還有蠻有趣的,對應的討論在「10mbps over 1km on a single pair of wires (botblox.io)」這邊。

在維基百科的「Ethernet over twisted pair」這個頁面上面有提到 10BASE-T1S 與 10BASE-T1L 這兩個在 2019 推出的新標準:

Two new variants of 10 megabit per second Ethernet over a single twisted pair, known as 10BASE-T1S and 10BASE-T1L, were standardized in IEEE Std 802.3cg-2019. 10BASE-T1S has its origins in the automotive industry and may be useful in other short-distance applications where substantial electrical noise is present. 10BASE-T1L is a long-distance Ethernet, supporting connections up to 1 km in length. Both of these standards are finding applications implementing the Internet of things.

從標準的名字就可以知道是 10Mbps 的速度,但只用一對線路就可以跑 1km 還蠻有趣的,主打在 IoT 場景...

HTTP 標準的翻新

HTTP 的標準之前都是用新的 RFC 補充與修正舊的標準,所以整體讀起來會比較累,對於開始了解 HTTP 的人會需要交叉讀才能理解。

而這次 RFC 9110~9114 算是一次性的把文件全部重新整理出來,可以看到蠻多人 (以及團體) 都有丟出來對應的看法,這邊丟這兩篇:「A New Definition of HTTP」與「HTTP RFCs have evolved: A Cloudflare view of HTTP usage trends」。

而這五個 RFC,從名稱列出來就可以看出來命名簡單粗暴,把核心概念先拆出來講,然後再講不同 protocol 的部份:

Cloudflare 這邊提供了一些資料,可以看到三個 protocol 使用率都算高,而目前最高的是 HTTP/2:

另外比較特別的是 Safari 在 HTTP/3 的趨勢居然有倒縮的情況:

然後 bot 的部份幾乎大家都支援 HTTP/2 了,目前還沒看到太多 HTTP/3 的蹤跡,倒是 LinkedIn 的 bot 有個奇怪的 adoption 然後全部 rollback 的情況,而最近又開始少量導入了:

這次看起來淘汰了 (obsolete) 很多之前的文件,以後要引用得往這五份來引...

在資料裡面經緯度的擺放順序:先經度再緯度,或是先緯度再經度?

Hacker News Daily 上看到「lon lat lon lat」這篇蠻有趣的 (然後作者也很無奈),整理出來目前各種 spec 內怎麼擺放經緯度的方式...

重點在這張:

台灣因為中文裡會講「經緯度」,習慣就是先經後緯,也就是 (lon, lat) 的形式,作者的偏好也是一樣,但世界上還是有其他地區的習慣是反過來的 XD

Hacker News 裡的討論「Lon Lat Lon Lat (macwright.com)」可以看出來國際標準內也是都有,所以大概不會有標準答案...

身為工程師,把這個現象當作某種容易中獎的 pitfall,遇到這類資料的時候要有習慣去確認 spec 定義的順序就可以了...

Python 的 Black

Hacker News 上看到 Black 這個幫你處理 Python 程式碼的工具:「Black, the uncompromising Python code formatter, is stable (pypi.org)」。

Black is the uncompromising Python code formatter. By using it, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting. You will save time and mental energy for more important matters.

然後從 Hacker News 上討論的情況看起來大家都覺得很不錯?好像可以看看能不能拿來用...

另外一個在討論的時候看到學到的東西,是 git blame --ignore-revs-file 這個功能,可以在 git blame 時濾掉某些 commit,剛好拿來過濾 reformatting commit:

Ignore revisions listed in file, which must be in the same format as an fsck.skipList. This option may be repeated, and these files will be processed after any files specified with the blame.ignoreRevsFile config option. An empty file name, "", will clear the list of revs from previously processed files.

RFC 3339 與 ISO 8601 用範例的比較

Hacker News 首頁上看到「RFC 3339 vs. ISO 8601 (ijmacd.github.io)」這個連結,原始網站是「RFC 3339 vs ISO 8601」。

兩個都是試著定義日期與時間的表達方式,其中 ISO 8601 是在 1988 年定義出來的,被廣泛應用在很多 spec 裡,但最大的問題就是 ISO 8601 沒辦法公開免費取得,這點在前幾個月也在 Hacker News 上被討論過:「ISO-8601 date format reference not publicly available (twitter.com/isostandards)」。

RFC 3339 是在 2002 年訂出來,其中一個目標就是試著解決 ISO 8601 不公開的問題 (沒講明就是了):

This document includes an Internet profile of the ISO 8601 [ISO8601] standard for representation of dates and times using the Gregorian calendar.

這份 RFC 3339 與 ISO 8601 的比較可以讓設計 spec 的人可以看一下,在大多數的情況下,RFC 3339 應該是夠用的...

Amazon SES 總算支援 2048 bits RSA key 了

Amazon SES 總算是支援 2048 bits RSA key 了:「Amazon SES now supports 2048-bit DKIM keys」。

然後講一些幹話... 隔壁微軟早在 2019 年就支援 2048 bits RSA key 了:

Until now, Amazon SES supported a DKIM key length of 1024-bit, which is the current industry standard.

另外用 ECC 演算法的一直都沒進 standard,像是已經先 book 了 RFC 8463 位置的 Ed25519,在 draft 狀態放好久了:「A New Cryptographic Signature Method for DomainKeys Identified Mail (DKIM)」,還有用 ECDSA 的「Defining Elliptic Curve Cryptography Algorithms for use with DKIM」也是放著,不知道是卡到什麼東西,可能是專利?