PostgreSQL 的 Fuzzy Matching

在「Fuzzy Name Matching in Postgres」這邊看到 PostgreSQL 下怎麼設計 Fuzzy Matching 的方式,文章裡用的方法主要是出自 PostgreSQL 的文件:「F.15. fuzzystrmatch」。

文章最後的解法是 Soundex + Levenshtein

翻了一下資料,這個領域另外有 NYSIIS (New York State Identification and Intelligence System):

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System (now a part of the New York State Division of Criminal Justice Services). It features an accuracy increase of 2.7% over the traditional Soundex algorithm.

以及 Metaphone

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

不過這些都是以英文為主,中文的沒特別翻到...

AWS WAF 支援 Regex (PCRE)

首先是 AWS WAF 支援 Regex 了:「AWS WAF Now Supports Regular Expressions (Regex)」。

而且是以 PCRE 版本為主:

AWS WAF supports most of the standard Perl Compatible Regular Expressions (PCRE).

這樣設定變得方便很多啊,大家都算熟 regex,而且也夠強大...

另外一個公告是 AWS WAF 可以將地區的當條件進行設定了:「AWS WAF Now Supports Geographic Match」。

除了針對某些地區擋掉或是開放以外,也可以針對不同地區設定 rate limit。當條件設定就是了...

各家 glob 的效能...

在「Glob Matching Can Be Simple And Fast Too」這邊看到在分析 (a.*)nb 這樣的 pattern 的效能 (像是 a.*a.*a.*b 這樣的東西),第一波先測 shell,結果發現有趣的現象:

那個 csh 是怎麼了 XDDD

Looking at the source code, it doesn’t attempt to perform glob expansion itself. Instead it calls the C library implementation glob(3), which runs in linear time, at least on this Linux system. So maybe we should look at programming language implementations too.

再來是看各程式語言:

各家實做方式不一樣 XD

然後文章裡有提到這個方式是之前蠻常見的 DoS 技巧,用很少的資源就可以吃光你的 CPU resource... 另外也提到了可能的解法,就是限制星號的數量:

At the very least, most FTP servers should probably reject glob patterns with more than, say, 3 stars.

後面演算法的部份也是很有趣的主題...

GitHub 提供更輕量的 Commit Reference SHA-1 API

GitHub 提供了新的 API 讓 client 可以更省網路資源,同時 GitHub 本身也可以省下 query。雖然是 Preview 期間,但已經有專案開始用了:「Commit Reference SHA-1 Preview Period」。

本來是這樣抓:

curl "https://api.github.com/repos/Homebrew/homebrew/commits/master" \
  -H "Accept: application/vnd.github.chitauri-preview+sha"

現在則可以加上 If-None-Match

curl "https://api.github.com/repos/Homebrew/homebrew/commits/master" \
  -H "Accept: application/vnd.github.chitauri-preview+sha" \
  -H "If-None-Match: \"814412cfbd631109df337e16c807207e78c0d24e\""

當本地與遠端的 SHA-1 值一樣時會收到 304,而且不會吃 rate limit quota:

If the remote and your local branch point to the same SHA-1 then this call will return a 304 Unmodified status code (and not use your rate limit).

尤其是當 commit reference 指到的 commit 特別大包時,可以省下很多資源...