PostgreSQL 的 Fuzzy Matching

在「Fuzzy Name Matching in Postgres」這邊看到 PostgreSQL 下怎麼設計 Fuzzy Matching 的方式,文章裡用的方法主要是出自 PostgreSQL 的文件:「F.15. fuzzystrmatch」。

文章最後的解法是 Soundex + Levenshtein

翻了一下資料,這個領域另外有 NYSIIS (New York State Identification and Intelligence System):

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System (now a part of the New York State Division of Criminal Justice Services). It features an accuracy increase of 2.7% over the traditional Soundex algorithm.

以及 Metaphone

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

不過這些都是以英文為主,中文的沒特別翻到...

各家 glob 的效能...

在「Glob Matching Can Be Simple And Fast Too」這邊看到在分析 (a.*)nb 這樣的 pattern 的效能 (像是 a.*a.*a.*b 這樣的東西),第一波先測 shell,結果發現有趣的現象:

那個 csh 是怎麼了 XDDD

Looking at the source code, it doesn’t attempt to perform glob expansion itself. Instead it calls the C library implementation glob(3), which runs in linear time, at least on this Linux system. So maybe we should look at programming language implementations too.

再來是看各程式語言:

各家實做方式不一樣 XD

然後文章裡有提到這個方式是之前蠻常見的 DoS 技巧,用很少的資源就可以吃光你的 CPU resource... 另外也提到了可能的解法,就是限制星號的數量:

At the very least, most FTP servers should probably reject glob patterns with more than, say, 3 stars.

後面演算法的部份也是很有趣的主題...