## Golang 的排序演算法將換成 pdqsort，LLVM libc++ 換成 BlockQuicksort

Hacker News 首頁上看到的消息，Golang 將會把 sort.Sort() 換成 pdqsort (Pattern-defeating Quicksort)：「Go will use pdqsort in next release (github.com/golang)」，對應的 commit 則是在「sort: use pdqsort」這邊可以看到。

It would be nice if only one or two of the sorting methods would dominate all of the others, regardless of application or the computer being used. But in fact, each method has its own peculiar virtues. […] Thus we find that nearly all of the algorithms deserve to be remembered, since there are some applications in which they turn out to be best.

Pattern-defeating quicksort (pdqsort) is a novel sorting algorithm that combines the fast average case of randomized quicksort with the fast worst case of heapsort, while achieving linear time on inputs with certain patterns.

Golang 選擇把 unstable 的 Quicksort 換成 pdqsort，LLVM 則是選擇把 Quicksort 換成 BlockQuicksort，這邊看起來有些分歧...

## 檔案壓縮順序造成壓縮率的差異

Hacker News Daily 上看到「Why are tar.xz files 15x smaller when using Python's tar library compared to macOS tar?」這篇，作者問了為什麼他用 Pythontarfile 壓出來比起用 tar 壓出來小了 15 倍，檔案都是 JSON 檔壓成 XZ 格式：

I'm compressing ~1.3 GB folders each filled with 1440 JSON files and find that there's a 15-fold difference between using the tar command on macOS or Raspbian 10 (Buster) and using Python's built-in tarfile library.

Ok, I think I found the issue: BSD tar and GNU tar without any sort options put the files in the archive in an undefined order.

After installing GNU tar on my Mac with:

`brew install gnu-tar`

And then tarring the same folder, but with the --sort option:

`gtar --sort='name' -cJf zsh-archive-sorted.tar.xz /Users/user/Desktop/temp/tar/2021-03-11`

I get a .tar.xz archive of 1.5 MB, equal to the archive created by the Python library.

My JSON files contain measurements from hundreds of sensors. Every minute I read out all sensors, but only a few of these sensors have a different value from minute to minute.

By sorting the files by name (which has the creation unixtime at the beginning of it), two subsequent files have very little different characters between them. Apparently this is very favourable for the compression efficiency.

## Amazon Redshift 會自動在背景重新排序資料以增加效能

Amazon Redshift 的新功能，會自動在背景重新排序資料以增加效能：「Amazon Redshift introduces Automatic Table Sort, an automated alternative to Vacuum Sort」。

This feature is available in Redshift 1.0.11118 and later.

Automatic table sort is now enabled by default on Redshift tables where a sort key is specified.

Redshift runs the sorting in the background and re-organizes the data in tables to maintain sort order and provide optimal performance. This operation does not interrupt query processing and reduces the compute resources required by operating only on frequently accessed blocks of data. It prioritizes which blocks of table to sort by analyzing query patterns using machine learning.

## 引用自己論文的問題...

Nature 上點出來期刊論文裡自我引用的問題 (這邊的自我引用包括了合作過的人)：「Hundreds of extreme self-citing scientists revealed in new database」。

Vaidyanathan, a computer scientist at the Vel Tech R&D Institute of Technology, a privately run institute, is an extreme example: he has received 94% of his citations from himself or his co-authors up to 2017, according to a study in PLoS Biology this month. He is not alone. The data set, which lists around 100,000 researchers, shows that at least 250 scientists have amassed more than 50% of their citations from themselves or their co-authors, while the median self-citation rate is 12.7%.

## JavaScript 的 sort 變成 stable

Firefox 則是很久前就決定使用 Merge sort 了 (看了一下，當時還在從 Firebird 轉換名稱到 Firefox 的時期)：「Array.sort isn't a stable sort (switch to MergeSort)」。

Timsort 是 1993 年發明出來的演算法，與 Merge sort 的情況類似，除了 stable 外，還可以保證最差的情境下的時間複雜度是 $O(n*log(n))$

Timsort is a hybrid stable sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data.

The algorithm finds subsequences of the data that are already ordered, and uses that knowledge to sort the remainder more efficiently.

Timsort has been Python's standard sorting algorithm since version 2.3. It is also used to sort arrays of non-primitive type in Java SE 7, on the Android platform, in GNU Octave, and Google Chrome.

## 對於按讚數排名的方法

Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter

```require 'statistics2'

def ci_lower_bound(pos, n, confidence)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-(1-confidence)/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end```

The z-score in this function never changes, so if you don't have a statistics package handy or if performance is an issue you can always hard-code a value here for z. (Use 1.96 for a confidence level of 0.95.)

```SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;```
```=IFERROR((([@[Up Votes]] + 1.9208) / ([@[Up Votes]] + [@[Down Votes]]) - 1.96 *
SQRT(([@[Up Votes]] *  [@[Down Votes]]) / ([@[Up Votes]] +  [@[Down Votes]]) + 0.9604) /
([@[Up Votes]] +  [@[Down Votes]])) / (1 + 3.8416 / ([@[Up Votes]] +  [@[Down Votes]])),0)```

## Go 1.6 把 HTTP/2 變成預設支援的功能

Go 的官方公告「Go 1.6 is released」提到了把 net/http 的 HTTP/2 預設啟用了：

In Go 1.6, support for HTTP/2 is enabled by default for both servers and clients when using HTTPS, bringing the benefits of the new protocol to a wide range of Go projects, such as the popular Caddy web server.

The algorithm inside sort.Sort was improved to run about 10% faster, but the change may break programs that expect a specific ordering of equal but distinguishable elements.