GitHub 自己開發的搜尋引擎

前陣子 GitHub 發了一篇文章,說明自己開發搜尋引擎的心路歷程:「The technology behind GitHub’s new code search」。

看了一下其實就是自己幹了一套 search engine cluster,然後針對 code search 把一些功能放進去。

目前這套 search enginer 還是 beta 版本,全站兩億個 repository 只包括了 4500 萬 (大概 22% 左右),然後已經有 115TB 的程式碼了;另外也題到了先前導入 Elasticsearch 時的數字是 800 萬個 repository:

GitHub’s scale is truly a unique challenge. When we first deployed Elasticsearch, it took months to index all of the code on GitHub (about 8 million repositories at the time). Today, that number is north of 200 million, and that code isn’t static: it’s constantly changing and that’s quite challenging for search engines to handle. For the beta, you can currently search almost 45 million repositories, representing 115 TB of code and 15.5 billion documents.

目前是 32 台機器,沒有特別提到記憶體大小,也沒有提到 replication 之類的數字:

Code search runs on 64 core, 32 machine clusters.

然後各種 inverted index 與各種資料在壓縮後只有 25TB:

There are some big wins on the size of the index as well. Remember that we started with 115 TB of content that we want to search. Content deduplication and delta indexing brings that down to around 28 TB of unique content. And the index itself clocks in at just 25 TB, which includes not only all the indices (including the ngrams), but also a compressed copy of all unique content. This means our total index size including the content is roughly a quarter the size of the original data!

換算一下,就會發現現在已經是「暴力」可以解很多事情的年代了,而這已經是全世界最大的 code hosting。

以前隨便一個主題搞大一點就會撞到 Amdahl's law,現在輕鬆不少...

Yahoo! 要重新搞搜尋引擎?

看到 Hacker News 上提到 Y! 要重新搞 search engine 的消息:「Yahoo is making a return to search」,HN 上的討論在「Yahoo is making a return to search (searchengineland.com)」這邊。

跡象包括了招募資訊與 Twitter 帳號 @YahooSearch 的重啟,還有 LinkedIn 上一些 Y! 的高層公開提到這次的招募。

現在的 Yahoo! Search 應該是 Bing 的資料 (很久沒有聽到新的消息了),至少從維基百科上面看到的說明提到 2019/10 後就又跳回 Bing 了:

As of October 2019, Yahoo! Search is once again powered by Bing.

這樣可以讓市場再多一點變化?

FBI 建議用擋廣告軟體降低瀏覽時的風險

在「Even the FBI says you should use an ad blocker」這邊看到的新聞,FBI 的公告則是在「Cyber Criminals Impersonating Brands Using Search Engine Advertisement Services to Defraud Users」這邊可以看到。

起因是有很多網路犯罪行為會透過購買廣告,在搜尋引擎上曝光誘導使用者點擊:

Cyber criminals purchase advertisements that appear within internet search results using a domain that is similar to an actual business or service. When a user searches for that business or service, these advertisements appear at the very top of search results with minimum distinction between an advertisement and an actual search result. These advertisements link to a webpage that looks identical to the impersonated business’s official webpage.

其中一種方式是,使用者輸入關鍵字想要下載某些特定的軟體,這時候網路犯罪者就會透過下廣告的方式,誘導使用者到假的網站下載有後門木馬的軟體:

In instances where a user is searching for a program to download, the fraudulent webpage has a link to download software that is actually malware. The download page looks legitimate and the download itself is named after the program the user intended to download.

這個方式讓我想到之前北韓政府對 PuTTY 的攻擊:「Trojanized versions of PuTTY utility being used to spread backdoor」。

而 FBI 建議個人的保護方式包括了 ad blocking extension,這算是減少被攻擊的管道:

Use an ad blocking extension when performing internet searches. Most internet browsers allow a user to add extensions, including extensions that block advertisements. These ad blockers can be turned on and off within a browser to permit advertisements on certain websites while blocking advertisements on others.

然後建議擋廣告軟體就是用 uBlock Origin,無論是 Chromium 系列的瀏覽器 (包括 Google Chrome),或是 Firefox 都有支援。

Kagi 可以分享搜尋結果了

先前 Kagi 限制只有註冊的使用者可以搜尋,但這次更新可以分享結果了 (在 Changelog 上可以看到):

Enable sharing of search results pages with people who lack a Kagi account #23 @ivan

這邊連結的是「Enable sharing of search results pages with people who lack a Kagi account」這個功能,看起來也算是個推銷其他人用 Kagi 的功能,所以就被做出來了。

我用無痕模式模擬沒有登入的情況,測了「Kalafina」這個關鍵字,確認可以看到這頁的結果了。

Kagi 的 url rewrite

Kagi 在上個禮拜推出了 url rewrite 的功能,可以把搜尋結果裡面的網址換掉:(在「Changelog」這邊可以看到)

Rewrite rules for domains - ability to e.g. translate "reddit.com" into "old.reddit.com" #158 @TeMPOraL

這個功能其實也可以在瀏覽器上用 extension 或是 userscript 處理掉 (跨機器可以透過瀏覽器的 cloud sync 來做),但目前應該還沒有這樣的東西,得自己寫一個出來。

範例提到把 reddit.com 換成 old.reddit.com 這種用法算是社群蠻常用的 (大家都不愛新版界面),不過我自己是把 *.m.wikipedia.org 轉成 *.wikipedia.org,這邊有多做一些事情,下面一條會提到:

^https://([a-z]+)\.m\.wikipedia\.org/[-a-z]*/(.*)|https://$1.wikipedia.org/wiki/$2

不難看出來是吃 regular expression,只是官方好像沒有特別說支援到哪種類型?(POSIX 系列或是 PCRE 類的,可以加減參考 Wikibooks 上「Regular Expressions」這邊的分類)

我另外一個 rule 是把維基百科有語言代碼的 https://zh.wikipedia.org/zh-*/ 轉成 https://zh.wikipedia.org/wiki/

^https://zh\.wikipedia\.org/zh-([a-z]*)/(.*)|https://zh.wikipedia.org/wiki/$2

這樣做的缺點是會出現兩個,可以看到第一個被轉了以後會出現一個小 icon,移上去可以看到是被哪個 rule 轉的:

JavaScript 上的 fuzzy search library

Hacker News Daily 上看到 Show HN (作者自己或是主要的 contributor 上來發表的作品) 給了一個號稱速度很快,吃資源很少的 fuzzy search library:「Show HN: uFuzzy.js – A tiny, efficient fuzzy search that doesn't suck (github.com/leeoniya)」。

這種已經發展許久,但突然有一天有人說他的東西超好超棒棒的,除非是有新的基礎演算法突破,不然馬上就會想到很經典的「Three circles model」,中間的那些區塊就懶的畫上去了:

依照他的「測試」,可以看到他宣稱完全領先的狀態:

但回過頭來看評論:

Thank you for this!

I am also quite frustrated with the current state of full text search in the javascript world. All libs I've tried miss the most basic examples and their community seems to ignore it. Will give yours a try but it already looks much better from the comparison page.

Edit: Nope, your lib doesn't seem to handle substitution well (THE most common type of typo), so yep, we are back to square one ...

From fuzzy search I expected that entering "super meet boy" or "super maet boy" will return "Super Meat Boy" but unfortunately currently it doesn't work this way and it's quite disappointing.

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...

看起來這個 library 沒有辦法解決 fuzzy search 最常見的 case (小 typo),依照範例描述的更像是 substring 搜尋加上一些額外的的功能,反而比較像是 auto completion library,或是講的比較廣一點,可以算是 auto suggestion library。

不過我覺得真正的重點 (對我來說的重點) 是下面的比較表格,因為列出了目前市場上的方案,這份清單之後可以拿來參考...

Kagi 公佈了收費三個月後的進展

Kagi 公佈了收費三個月後的進展 (可以參考「Kagi 開始收費了」這篇):「Kagi status update: First three months」。

搜尋的部份 (Kagi 這個產品線),目前有 2600 個付費使用者,以 US$10/mo 的費用來算大概是 US$26K/mo 的收入:

Kagi search is currently serving ~2,600 paid customers. We have seen steady growth since the launch 3 months ago. Note, this is with zero marketing and fully relying on word of mouth. We prefer to keep things this way for now, as we are still developing the product towards our vision of a user-centric web search experience.

後面在講財務狀況也是類似的數字 (幾乎都是 Kagi 的付費收入):

Between Kagi and Orion, we are currently generating around $26,500 USD in monthly recurring revenue, which incidentally about exactly covers our current API and infrastructure costs.

這個收入差不多 cover 目前的 infrastructure 部份,但還有薪資與其他的 operating cost 大約在 US$100K/mo 這個數量級,看起來還有很大的距離:

Between Kagi and Orion, we are currently generating around $26,500 USD in monthly recurring revenue, which incidentally about exactly covers our current API and infrastructure costs.

That means that salaries and all other operating costs (order of magnitude of $100K USD/month) remain a challenge and are still paid out of the founders’ pocket (Kagi remains completely bootstrapped).

然後要大概是目前十倍的付費數量才會打平 (25K 個使用者):

We are planning to reach sustainability at around 25,000 users mark, by further improving the product, introducing new offerings and pricing changes. With the product metrics being as good as they are, we should be able to reach this as our visibility increases.

比較好一點的消息是 churn rate 很低:

Product stickiness is also very high, with churn being lower than 3%.

然後提到每個使用者大約 27 次查詢 (包括 free tier),有些 user 大約在 100 次,peak 是 400 次:

We are currently serving around 70,000 queries a day or around ~27 queries/day/user (this includes free users which are about 10% of total users). There is a lot of variance in use though, with some users regularly searching >100 times a day. Every time we see a search count go >400 times in day we are happy to be an important part of someone’s search experience.

我看了一下自己的用量,看起來偏高一些,但沒到他說的每天平均 100 次:

然後提到了推出新方案的計畫,包括 Teams Plan & Family Plan,而目前在跑的方案會被分類到 Individual Plans。

另外比較重要的是 Individual Plans 有漲價的計畫。新的方案預定分成三個層級,主要是增加了一個 Kagi Starter 的版本:

  • Kagi Unlimited - $19/mo or $180/year ($15/mo) or $288/biennial ($12/mo) - Original Kagi experience, with unlimited searches
  • Kagi Starter ($5/mo; 200 searches) - For casual users who make less than 200 searches per month
  • Free basic - 50 free searches that reset every month

漲不少,雖然有提到在漲價前既有的付費使用者將會維持原價:

If such change to Individual plans is to occur, we plan to grandfather-in all early adopters (meaning all current and future paid customers, up until this change) allowing them to keep their existing subscription price as long as they don’t cancel it.

繼續觀察看看...

滲透測試的工具,各種搜尋引擎

Twitter 上看到的東西:

裡面是一張圖,整理一下這 24 個站台:

一堆 .io 網域...

裡面有蠻多服務是偶而會用到的,改拿來當作 pen test 的基礎工作也是蠻好用的,各種預先掃好的結果拿來搜...

Google 說要把 double quote 強制搜尋的功能加回來...

Hacker News Daily 上看到「We're improving search results when you use quotes (blog.google)」這則,才知道原來被拔掉了?(不過已經很久不是拿 Google Search 當主力了...)

原文在「How we're improving search results when you use quotes」這邊,裡面提到:

For example, if you did a search such as [“google search”], the snippet will show where that exact phrase appears:

[...]

In the past, we didn’t always do this because sometimes the quoted material appears in areas of a document that don’t lend themselves to creating helpful snippets.

在「Google for the exact phrase (and no, quotation marks don't help)」這邊可以看到 2020 的時候 double quote 就已經不是傳回精確的結果了。

不過應該不會回去用 Google Search 了,一方面是 Kagi 的表現還不錯,另外一方面是避免讓 Google 拿到更多資訊...

微軟的 Outlook 系統會自動點擊信件內的連結

前幾天在 Hacker News Daily 上翻到的,微軟的 Outlook 系統 (雲端上的系統) 會自動點擊信件內的連結,導致一堆問題:「“Magic links” can end up in Bing search results — rendering them useless.」,在 Hacker News 上的討論也有很多受害者出來抱怨:「“Magic links” can end up in Bing search results, rendering them useless (medium.com/ryanbadger)」。

原文的標題寫的更批評,指控 Outlook 會把這些 link 丟到 Bing 裡面 index,這點還沒有看到確切的證據。

先回到連結被點擊的問題,照文章內引用的資料來看,看起來是 2017 年開始就有的情況:「Do any common email clients pre-fetch links rather than images?」。

As of Feb 2017 Outlook (https://outlook.live.com/) scans emails arriving in your inbox and it sends all found URLs to Bing, to be indexed by Bing crawler.

在 Hacker News 上的討論也提到了像是 one-time login email 的機制也會因此受到影響,被迫要用比較費工夫的方法讓使用者登入 (像是給使用者 one-time code 輸入,而不是點 link 就可以登入)。

先記起來,以後在設計時應該會遇到,要重新思考 threat model...