Brave 宣佈 Brave Search 完全使用自己的資料,不用 Bing 的了

Brave 宣佈 Brave Search 獨立,不再使用 Bing 的資料了:「Brave Search removes last remnant of Bing from search results page, achieving 100% independence and providing real alternative to Big Tech search」。

依照 Brave 的說明,先前大約 7% 是來自 Bing 的 API:

Every Web search result seen in Brave Search is now served by our own index. We’ve removed all search API calls to Bing, which previously represented about 7% of query results.

但只要商業模式還是廣告,我應該不會去用... 但比起目前市場上一堆包其他家 search engine 的搜尋引擎來說,這的確是個好消息。

測了 PGroonga,PostgreSQL 上的 fulltext search engine

PostgreSQL 的 news 頁上看到「PGroonga 3.0.0 - Multilingual fast full text search」,想到一直沒有測過 PGroonga,就找台機器測了一下。

PGroonga 是以 Groonga 為引擎提供 PostgreSQL 全文搜尋能力的套件,是個能支援 CJK 語系的全文搜尋套件。

可以先看一下支援的 column type 與對應的語法:「Reference manual | PGroonga」,可以發現基本的 texttext[]varcharvarchar[] 都有支援,比較特別的是有 jsonb,看起來是對裡面的 text 欄位搜尋。

另外一個比較特別的是他會去配合 LIKE '%something%' 這樣的語法,對於無法修改的既有程式也會有幫助。

缺點方面,官方有提到產生出來的 index 會比其他的套件大,但畢竟我們在的環境要支援 CJK,場上的選手已經不多了。

另外一個缺點是目前 AWSRDSGCPCloud SQL 看起來都沒支援,要用的話得自己架 & 自己管,也許可以考慮用老方法,replication 接出來?

接下來就是安裝測試了,我在 x86-64 上的 Ubuntu 22.04 上面測試,就照著「Install on Ubuntu | PGroonga」這頁裡面的「How to install for system PostgreSQL」這段就可以了,裝系統的 PostgreSQL 14 以及 postgresql-14-pgroonga,之後要用 PostgreSQL 官方的新版的話可以參考「How to install for the official PostgreSQL」這段的安裝。

後續再到「Tutorial | PGroonga」頁,針對要搜尋的欄位下 index (這邊裱格式 memos,欄位是 content):

CREATE INDEX ON memos USING pgroonga (content);

官方的教學文件裡是用 SET enable_seqscan = off; 關閉 sequence scan,可以用 EXPLAIN 看到使用了 index:

test=# SELECT * FROM memos WHERE content &@ 'engine';
 id |                                content                                 
----+------------------------------------------------------------------------
  2 | Groonga is a fast full text search engine that supports all languages.
(1 row)

test=# EXPLAIN SELECT * FROM memos WHERE content &@ 'engine';
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Index Scan using memos_content_idx on memos  (cost=0.00..43.18 rows=1 width=36)
   Index Cond: (content &@ 'engine'::text)
(2 rows)

先拔掉 index:

test=# DROP INDEX pgroonga_content_index;
DROP INDEX

接著要塞資料,這邊拿 CQD 生的「中文假文產生器」來用,有 API 可以接比較方便。

test=# SELECT COUNT(*) FROM memos;
 count  
--------
 100000
(1 row)

Time: 15.495 ms

接著多跑幾次測試直接用 LIKE '%台北%' 去找,可以看到大概都在 150ms 以上:

test=# SELECT COUNT(*) FROM memos WHERE content LIKE '%台北%';
 count 
-------
   710
(1 row)

Time: 178.784 ms

接著來建立 index:

test=# CREATE INDEX ON memos USING pgroonga (content);
CREATE INDEX
Time: 17638.124 ms (00:17.638)

再跑幾次同樣的 query,可以看到巨大的改善:

test=# SELECT COUNT(*) FROM memos WHERE content LIKE '%台北%';
 count 
-------
   710
(1 row)

Time: 9.876 ms

Kagi 再次漲價

Kagi 在 2022 年九月的時候漲過一次:「Kagi 公佈了收費三個月後的進展」,當時的漲法是把本來不限次數搜尋的方案從 $10/mo 漲到 $19/mo,並且取名叫做 Kagi Unlimited。

這次則是宣佈要在 3/15 把無限搜尋的方案漲到 $25/mo:

We are also introducing the Ultimate plan with unlimited searches for $25/month.

而且本來的用戶 (應該是包括最早期的 $10/mo 與後來的 $19/mo 方案) 都會被強制改到 $25/mo:

Every existing subscriber will have their current subscription honored until expiration. That means if you are a subscriber at the time of the new plan rollout, you will still get unlimited searches as a part of your original plan until your existing subscription expires or is set to renew.

Hacker News 上的討論「Update to Kagi Search Pricing (kagi.com)」可以看到很多人都注意到 Kagi 想要投入 AI 的問題:

This is so incredibly disappointing and basically confirms that we're (at least in part) paying for their AI experiments - something that I personally am not at all interested in.

這次漲價然後把錢拿去投入 AI?而且看起來整個 Kagi 不是自己掌握 machine learning 技術,是用其他人的 API (應該就是 OpenAI)。這鐵定是一堆人不買單...

月底要 renew 的時候再來看看吧...

GitHub 自己開發的搜尋引擎

前陣子 GitHub 發了一篇文章,說明自己開發搜尋引擎的心路歷程:「The technology behind GitHub’s new code search」。

看了一下其實就是自己幹了一套 search engine cluster,然後針對 code search 把一些功能放進去。

目前這套 search enginer 還是 beta 版本,全站兩億個 repository 只包括了 4500 萬 (大概 22% 左右),然後已經有 115TB 的程式碼了;另外也題到了先前導入 Elasticsearch 時的數字是 800 萬個 repository:

GitHub’s scale is truly a unique challenge. When we first deployed Elasticsearch, it took months to index all of the code on GitHub (about 8 million repositories at the time). Today, that number is north of 200 million, and that code isn’t static: it’s constantly changing and that’s quite challenging for search engines to handle. For the beta, you can currently search almost 45 million repositories, representing 115 TB of code and 15.5 billion documents.

目前是 32 台機器,沒有特別提到記憶體大小,也沒有提到 replication 之類的數字:

Code search runs on 64 core, 32 machine clusters.

然後各種 inverted index 與各種資料在壓縮後只有 25TB:

There are some big wins on the size of the index as well. Remember that we started with 115 TB of content that we want to search. Content deduplication and delta indexing brings that down to around 28 TB of unique content. And the index itself clocks in at just 25 TB, which includes not only all the indices (including the ngrams), but also a compressed copy of all unique content. This means our total index size including the content is roughly a quarter the size of the original data!

換算一下,就會發現現在已經是「暴力」可以解很多事情的年代了,而這已經是全世界最大的 code hosting。

以前隨便一個主題搞大一點就會撞到 Amdahl's law,現在輕鬆不少...

Kagi 開發的 Universal Summarizer

在「Universal Summarizer (kagi.com)」這邊看到的新服務,可以給出某個網址的 summary。服務的本體則是在「Kagi - Universal Summarizer」這邊,從網址可以猜測是 Kagi 的實驗項目。

像是 CNN 的「Biden’s dramatic warning to China」這篇,他抓出對應的 summary 看起來沒什麼問題:

President Joe Biden delivered a dramatic warning to China in his State of the Union address, vowing to protect America's sovereignty if China threatens it. He specifically named President Xi Jinping, saying "Name me a world leader who'd change places with Xi Jinping. Name me one!" This marked a stark escalation in the US-Chinese relationship, which has been strained by a balloon surveillance program and other issues. Biden also addressed Russia, calling their invasion of Ukraine a test for America and the world. His speech highlighted the unified opposition to China in US politics, with House Speaker Kevin McCarthy having convened a bipartisan House committee to examine the perceived threat from the Chinese Communist Party. Biden's comments also served as an important milestone in the increasingly tumultuous competition between the US and China, as the US shifts to talking about establishing guardrails for the relationship and protecting the Western-led rules-based international system.

這個功能可以用 GPT-3.5 或是 ChatGPT 串,但不確定 Kagi 是串上去還自己搞?我猜有蠻大的機會是串的...

然後他的網址設計因為是 url 傳遞的方式,可以包裝成 bookmarklet 形式放在快速列上面用:「Kagi Universal Summarizer」。

另外在 HN 的討論裡面看到很厲害的用法,把 https://bellard.org/quickjs/pi_bigdecimal.js 這段 javascript code 的 url 丟進去,然後居然出現了程式碼的說明,居然還正確判斷出是 Chudnovsky algorithm

This code uses the QuickJS bigdecimal type to calculate the value of pi to a given precision. It does this by using the Chudnovsky algorithm, which is a series of calculations that can be used to approximate pi. The code is written in Javascript and uses BigInt and BigDecimal to perform the calculations. It is interesting to note that the code also takes into account the possibility of bad rounding for the last digits, and adds extra digits to reduce the probability of this happening.

既然包了一個 bookmarklet,最近應該會常常拿出來用...

Yahoo! 要重新搞搜尋引擎?

看到 Hacker News 上提到 Y! 要重新搞 search engine 的消息:「Yahoo is making a return to search」,HN 上的討論在「Yahoo is making a return to search (searchengineland.com)」這邊。

跡象包括了招募資訊與 Twitter 帳號 @YahooSearch 的重啟,還有 LinkedIn 上一些 Y! 的高層公開提到這次的招募。

現在的 Yahoo! Search 應該是 Bing 的資料 (很久沒有聽到新的消息了),至少從維基百科上面看到的說明提到 2019/10 後就又跳回 Bing 了:

As of October 2019, Yahoo! Search is once again powered by Bing.

這樣可以讓市場再多一點變化?

FBI 建議用擋廣告軟體降低瀏覽時的風險

在「Even the FBI says you should use an ad blocker」這邊看到的新聞,FBI 的公告則是在「Cyber Criminals Impersonating Brands Using Search Engine Advertisement Services to Defraud Users」這邊可以看到。

起因是有很多網路犯罪行為會透過購買廣告,在搜尋引擎上曝光誘導使用者點擊:

Cyber criminals purchase advertisements that appear within internet search results using a domain that is similar to an actual business or service. When a user searches for that business or service, these advertisements appear at the very top of search results with minimum distinction between an advertisement and an actual search result. These advertisements link to a webpage that looks identical to the impersonated business’s official webpage.

其中一種方式是,使用者輸入關鍵字想要下載某些特定的軟體,這時候網路犯罪者就會透過下廣告的方式,誘導使用者到假的網站下載有後門木馬的軟體:

In instances where a user is searching for a program to download, the fraudulent webpage has a link to download software that is actually malware. The download page looks legitimate and the download itself is named after the program the user intended to download.

這個方式讓我想到之前北韓政府對 PuTTY 的攻擊:「Trojanized versions of PuTTY utility being used to spread backdoor」。

而 FBI 建議個人的保護方式包括了 ad blocking extension,這算是減少被攻擊的管道:

Use an ad blocking extension when performing internet searches. Most internet browsers allow a user to add extensions, including extensions that block advertisements. These ad blockers can be turned on and off within a browser to permit advertisements on certain websites while blocking advertisements on others.

然後建議擋廣告軟體就是用 uBlock Origin,無論是 Chromium 系列的瀏覽器 (包括 Google Chrome),或是 Firefox 都有支援。

Kagi 可以分享搜尋結果了

先前 Kagi 限制只有註冊的使用者可以搜尋,但這次更新可以分享結果了 (在 Changelog 上可以看到):

Enable sharing of search results pages with people who lack a Kagi account #23 @ivan

這邊連結的是「Enable sharing of search results pages with people who lack a Kagi account」這個功能,看起來也算是個推銷其他人用 Kagi 的功能,所以就被做出來了。

我用無痕模式模擬沒有登入的情況,測了「Kalafina」這個關鍵字,確認可以看到這頁的結果了。

Kagi 的 url rewrite

Kagi 在上個禮拜推出了 url rewrite 的功能,可以把搜尋結果裡面的網址換掉:(在「Changelog」這邊可以看到)

Rewrite rules for domains - ability to e.g. translate "reddit.com" into "old.reddit.com" #158 @TeMPOraL

這個功能其實也可以在瀏覽器上用 extension 或是 userscript 處理掉 (跨機器可以透過瀏覽器的 cloud sync 來做),但目前應該還沒有這樣的東西,得自己寫一個出來。

範例提到把 reddit.com 換成 old.reddit.com 這種用法算是社群蠻常用的 (大家都不愛新版界面),不過我自己是把 *.m.wikipedia.org 轉成 *.wikipedia.org,這邊有多做一些事情,下面一條會提到:

^https://([a-z]+)\.m\.wikipedia\.org/[-a-z]*/(.*)|https://$1.wikipedia.org/wiki/$2

不難看出來是吃 regular expression,只是官方好像沒有特別說支援到哪種類型?(POSIX 系列或是 PCRE 類的,可以加減參考 Wikibooks 上「Regular Expressions」這邊的分類)

我另外一個 rule 是把維基百科有語言代碼的 https://zh.wikipedia.org/zh-*/ 轉成 https://zh.wikipedia.org/wiki/

^https://zh\.wikipedia\.org/zh-([a-z]*)/(.*)|https://zh.wikipedia.org/wiki/$2

這樣做的缺點是會出現兩個,可以看到第一個被轉了以後會出現一個小 icon,移上去可以看到是被哪個 rule 轉的:

JavaScript 上的 fuzzy search library

Hacker News Daily 上看到 Show HN (作者自己或是主要的 contributor 上來發表的作品) 給了一個號稱速度很快,吃資源很少的 fuzzy search library:「Show HN: uFuzzy.js – A tiny, efficient fuzzy search that doesn't suck (github.com/leeoniya)」。

這種已經發展許久,但突然有一天有人說他的東西超好超棒棒的,除非是有新的基礎演算法突破,不然馬上就會想到很經典的「Three circles model」,中間的那些區塊就懶的畫上去了:

依照他的「測試」,可以看到他宣稱完全領先的狀態:

但回過頭來看評論:

Thank you for this!

I am also quite frustrated with the current state of full text search in the javascript world. All libs I've tried miss the most basic examples and their community seems to ignore it. Will give yours a try but it already looks much better from the comparison page.

Edit: Nope, your lib doesn't seem to handle substitution well (THE most common type of typo), so yep, we are back to square one ...

From fuzzy search I expected that entering "super meet boy" or "super maet boy" will return "Super Meat Boy" but unfortunately currently it doesn't work this way and it's quite disappointing.

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...

看起來這個 library 沒有辦法解決 fuzzy search 最常見的 case (小 typo),依照範例描述的更像是 substring 搜尋加上一些額外的的功能,反而比較像是 auto completion library,或是講的比較廣一點,可以算是 auto suggestion library。

不過我覺得真正的重點 (對我來說的重點) 是下面的比較表格,因為列出了目前市場上的方案,這份清單之後可以拿來參考...