JavaScript 上的 fuzzy search library

Hacker News Daily 上看到 Show HN (作者自己或是主要的 contributor 上來發表的作品) 給了一個號稱速度很快,吃資源很少的 fuzzy search library:「Show HN: uFuzzy.js – A tiny, efficient fuzzy search that doesn't suck (github.com/leeoniya)」。

這種已經發展許久,但突然有一天有人說他的東西超好超棒棒的,除非是有新的基礎演算法突破,不然馬上就會想到很經典的「Three circles model」,中間的那些區塊就懶的畫上去了:

依照他的「測試」,可以看到他宣稱完全領先的狀態:

但回過頭來看評論:

Thank you for this!

I am also quite frustrated with the current state of full text search in the javascript world. All libs I've tried miss the most basic examples and their community seems to ignore it. Will give yours a try but it already looks much better from the comparison page.

Edit: Nope, your lib doesn't seem to handle substitution well (THE most common type of typo), so yep, we are back to square one ...

From fuzzy search I expected that entering "super meet boy" or "super maet boy" will return "Super Meat Boy" but unfortunately currently it doesn't work this way and it's quite disappointing.

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...

看起來這個 library 沒有辦法解決 fuzzy search 最常見的 case (小 typo),依照範例描述的更像是 substring 搜尋加上一些額外的的功能,反而比較像是 auto completion library,或是講的比較廣一點,可以算是 auto suggestion library。

不過我覺得真正的重點 (對我來說的重點) 是下面的比較表格,因為列出了目前市場上的方案,這份清單之後可以拿來參考...

Kagi 公佈了收費三個月後的進展

Kagi 公佈了收費三個月後的進展 (可以參考「Kagi 開始收費了」這篇):「Kagi status update: First three months」。

搜尋的部份 (Kagi 這個產品線),目前有 2600 個付費使用者,以 US$10/mo 的費用來算大概是 US$26K/mo 的收入:

Kagi search is currently serving ~2,600 paid customers. We have seen steady growth since the launch 3 months ago. Note, this is with zero marketing and fully relying on word of mouth. We prefer to keep things this way for now, as we are still developing the product towards our vision of a user-centric web search experience.

後面在講財務狀況也是類似的數字 (幾乎都是 Kagi 的付費收入):

Between Kagi and Orion, we are currently generating around $26,500 USD in monthly recurring revenue, which incidentally about exactly covers our current API and infrastructure costs.

這個收入差不多 cover 目前的 infrastructure 部份,但還有薪資與其他的 operating cost 大約在 US$100K/mo 這個數量級,看起來還有很大的距離:

Between Kagi and Orion, we are currently generating around $26,500 USD in monthly recurring revenue, which incidentally about exactly covers our current API and infrastructure costs.

That means that salaries and all other operating costs (order of magnitude of $100K USD/month) remain a challenge and are still paid out of the founders’ pocket (Kagi remains completely bootstrapped).

然後要大概是目前十倍的付費數量才會打平 (25K 個使用者):

We are planning to reach sustainability at around 25,000 users mark, by further improving the product, introducing new offerings and pricing changes. With the product metrics being as good as they are, we should be able to reach this as our visibility increases.

比較好一點的消息是 churn rate 很低:

Product stickiness is also very high, with churn being lower than 3%.

然後提到每個使用者大約 27 次查詢 (包括 free tier),有些 user 大約在 100 次,peak 是 400 次:

We are currently serving around 70,000 queries a day or around ~27 queries/day/user (this includes free users which are about 10% of total users). There is a lot of variance in use though, with some users regularly searching >100 times a day. Every time we see a search count go >400 times in day we are happy to be an important part of someone’s search experience.

我看了一下自己的用量,看起來偏高一些,但沒到他說的每天平均 100 次:

然後提到了推出新方案的計畫,包括 Teams Plan & Family Plan,而目前在跑的方案會被分類到 Individual Plans。

另外比較重要的是 Individual Plans 有漲價的計畫。新的方案預定分成三個層級,主要是增加了一個 Kagi Starter 的版本:

  • Kagi Unlimited - $19/mo or $180/year ($15/mo) or $288/biennial ($12/mo) - Original Kagi experience, with unlimited searches
  • Kagi Starter ($5/mo; 200 searches) - For casual users who make less than 200 searches per month
  • Free basic - 50 free searches that reset every month

漲不少,雖然有提到在漲價前既有的付費使用者將會維持原價:

If such change to Individual plans is to occur, we plan to grandfather-in all early adopters (meaning all current and future paid customers, up until this change) allowing them to keep their existing subscription price as long as they don’t cancel it.

繼續觀察看看...

滲透測試的工具,各種搜尋引擎

Twitter 上看到的東西:

裡面是一張圖,整理一下這 24 個站台:

一堆 .io 網域...

裡面有蠻多服務是偶而會用到的,改拿來當作 pen test 的基礎工作也是蠻好用的,各種預先掃好的結果拿來搜...

Google 說要把 double quote 強制搜尋的功能加回來...

Hacker News Daily 上看到「We're improving search results when you use quotes (blog.google)」這則,才知道原來被拔掉了?(不過已經很久不是拿 Google Search 當主力了...)

原文在「How we're improving search results when you use quotes」這邊,裡面提到:

For example, if you did a search such as [“google search”], the snippet will show where that exact phrase appears:

[...]

In the past, we didn’t always do this because sometimes the quoted material appears in areas of a document that don’t lend themselves to creating helpful snippets.

在「Google for the exact phrase (and no, quotation marks don't help)」這邊可以看到 2020 的時候 double quote 就已經不是傳回精確的結果了。

不過應該不會回去用 Google Search 了,一方面是 Kagi 的表現還不錯,另外一方面是避免讓 Google 拿到更多資訊...

微軟的 Outlook 系統會自動點擊信件內的連結

前幾天在 Hacker News Daily 上翻到的,微軟的 Outlook 系統 (雲端上的系統) 會自動點擊信件內的連結,導致一堆問題:「“Magic links” can end up in Bing search results — rendering them useless.」,在 Hacker News 上的討論也有很多受害者出來抱怨:「“Magic links” can end up in Bing search results, rendering them useless (medium.com/ryanbadger)」。

原文的標題寫的更批評,指控 Outlook 會把這些 link 丟到 Bing 裡面 index,這點還沒有看到確切的證據。

先回到連結被點擊的問題,照文章內引用的資料來看,看起來是 2017 年開始就有的情況:「Do any common email clients pre-fetch links rather than images?」。

As of Feb 2017 Outlook (https://outlook.live.com/) scans emails arriving in your inbox and it sends all found URLs to Bing, to be indexed by Bing crawler.

在 Hacker News 上的討論也提到了像是 one-time login email 的機制也會因此受到影響,被迫要用比較費工夫的方法讓使用者登入 (像是給使用者 one-time code 輸入,而不是點 link 就可以登入)。

先記起來,以後在設計時應該會遇到,要重新思考 threat model...

搜尋引擎的替代方案清單

看到「A look at search engines with their own indexes」這篇在介紹各個搜尋引擎,作者設計了一套方法測試,另外在文章裡面也給了很多主觀的意見,算是很有參考價值的,可以試看看裡面提出來的建議。

另外在 Hacker News 上也有討論可以參考:「A look at search engines with their own indexes (2021) (seirdy.one)」。

在文章開頭的「General indexing search-engines」這個章節,先列出三大搜尋引擎 GBY (GoogleBingYandex),以及使用這三家當作後端資料庫的搜尋引擎,可以看到到處都是 Bing 的影子。

接著作者推薦 Mojeek 這個作為 GBY 的替代方案:

Mojeek: Seems privacy-oriented with a large index containing billions of pages. Quality isn’t at GBY’s level, but it’s not bad either. If I had to use Mojeek as my default general search engine, I’d live. Partially powers eTools.ch. At this moment, I think that Mojeek is the best alternative to GBY for general search.

在「Smaller indexes or less relevant results」這邊也有一些方案,像是這個章節第一個提到的 Right Dao,作者就給他了不錯的評價:

Right Dao: very fast, good results. Passes the tests fairly well. It plans on including query-based ads if/when its user base grows.

接下來的「Smaller indexes, hit-and-miss」與「Unusable engines, irrelevant results」也可以翻一下,看看作者怎麼批評 XD

然後是後面的「Semi-independent indexes」就出現了最近幾個比較有名的,像是 Brave Search 與目前我在用的 Kagi

整理的相當不錯...

Kagi 開始收費了

在「Kagi search and Orion browser enter public beta」這邊可以看到 public beta 與收費的消息:

We decided to start charging for Kagi search while in beta status because the cost of beta usage has gone up dramatically and we are not able to sustain it. Also, we want to get some kind of a financial “airworthiness” signal and see how we stand with our positioning and outlook for sustainability.

收費的費用是 US$10/mo,剛剛先把信用卡掛上去了...

Kagi 的搜尋引擎開放註冊,以及公佈付費方案

先前提過 Kagi 這個搜尋引擎 (「來測試 Kagi 這個搜尋引擎」與「用兩個禮拜 Kagi 的心得」),剛剛翻信箱時看到他們的信件,看起來現在任何人都可以註冊了:

If by any chance you do not have a Kagi account yet, you can make one at https://kagi.com/signup?invite_code=humaneweb (feel free to share with your friends)

另外有提到目前規劃的收費計畫是 US$10/mo:

Kagi will come as a free version with limited use; and an unlimited use, paid option at $10 a month, both versions having great search results with less spam and completely ad-free, tracking free, and with none of your search data being retained.

Kagi 已經是我目前預設的 search engine,而且品質其實相當滿意 (偶而會切到 DuckDuckGo 以及 Google 比較),之後就等付費機制上線...

公平會對創業家兄弟與松果公司的 SEO 誘導轉向開罰

好像很少提到國內的新聞,但這則應該是這兩天蠻熱門的一個新聞,創業家兄弟與松果公司 (也是創業家兄弟公司) 被公平會開罰:「操作SEO搜尋關鍵字誤導消費者 創業家兄弟、松果公司挨罰」,相關的備份先留起來:Internet Archivearchive.today

公平會官方的新聞稿則可以在「利用程式設計引誘消費者「逛錯街」,公平會開罰」這邊看到,對應的網頁備份:Internet Archivearchive.today

用的是公平交易法第 25 條:

公平會於4月12日第1594次委員會議通過,創業家兄弟股份有限公司及松果購物股份有限公司利用「搜尋引擎優化 (Search Engine Optimization,簡稱SEO)」技術,並在搜尋引擎的顯示結果上不當顯示特定品牌名稱,使消費者誤認該賣場有販售特定品牌產品,藉以增進自身網站到訪率,違反公平交易法第25條規定,處創業家兄弟公司200萬元、松果公司80萬元罰鍰。

這條的條文可以從「公平交易法§25-全國法規資料庫」這邊看到:

除本法另有規定者外,事業亦不得為其他足以影響交易秩序之欺罔或顯失公平之行為。

主要的原因是點進去後卻沒有該項商品:

公平會發現,消費者在Google搜尋引擎打上特定品牌名稱,例如「悅夢床墊」時,搜尋結果會出現「悅夢床墊的熱銷搜尋結果│生活市集」、「人氣熱銷悅夢床墊口碑推薦品牌整理─松果購物」等搜尋結果,消費者被前述搜尋結果吸引點選進入「生活市集」、「松果購物」網站後,卻發現該賣場並無「悅夢床墊」之產品,此係生活市集及松果購物之經營者創業家兄弟公司及松果公司分別利用SEO技術所產生的現象。

而且會透過使用者在往站上搜尋的關鍵字產生對應的頁面:

公平會進一步調查後發現,創業家兄弟公司及松果公司對其所經營之「生活市集」及「松果購物」網頁進行設計,只要網路使用者在該2網站搜尋過「悅夢床墊」,縱然該2網站賣場並沒有賣「悅夢床墊」,其網站程式也會主動生成行銷文案網頁,以供搜尋引擎攫取。若有消費者之後在Google搜尋引擎查詢「悅夢床墊」時,搜尋結果便會帶出「悅夢床墊的熱銷搜尋結果│生活市集」、「人氣熱銷悅夢床墊口碑推薦品牌整理─松果購物」等搜尋結果項目,經消費者點選後即會導向「生活市集」、「松果購物」之網站。

然後判罰的部份:

公平會過往即曾就事業使用競爭對手事業名稱作為關鍵字廣告,並在關鍵字廣告併列競爭對手事業名稱之行為,認定違反公平交易法第25條規定。本案雖非創業家兄弟公司及松果公司直接使用「悅夢床墊」等他人商品品牌作為關鍵字廣告,但最終呈現之結果,本質上都是「誘導/轉向」(bait-and-switch)的欺罔行為,除了打斷消費者正常的商品搜尋與購買過程,也對其他販售該等品牌商品之經營者形成不公平競爭的效果。若任由發生而不予規範,未來將可能導致其他競爭者之競相仿效,消費者將更難以分辨搜尋結果呈現資訊之真偽,進而威脅電商市場之競爭秩序及消費者利益。故公平會認為違反公平交易法第25條「足以影響交易秩序之欺罔及顯失公平行為」,並分別處創業家兄弟公司200萬元、松果公司80萬元罰鍰。

所以這算是對 Dark pattern SEO 的部份開罰...

又有 Blog Search Engine 了:Blog Surf

在「Show HN: Search Engine for Blogs (blogsurf.io)」這邊看到又有 blog search engine 了,叫做 Blog Surf

比較有趣的應該是留言裡面看到這個,已經掛掉的先人出來說,以前這個使用族群都是在打手槍的族群 XDDD

mgarfias 12 hours ago

We, sphere.com, did this starting in 2006. After a year or so, we realized the only people using the service were looking to stroke their egos.

Ice rocket, and something else (I can’t remember the name) tried it at the same time and failed.

We pivoted, which ended up leading to some unspeakable horrors.

At any rate, good luck, hope it works better for you.

回到 Blog Surf 來看,在 About 頁上提到了 MarketRank,基本上就是服務作者提出來的演算法:

Points are calculated using Market Rank. They are a measure of the popularity of a post across online communities. Blog points are simply the sum of a blog’s post points.

不是太看好但就觀察看看...