Home » Computer » Archive by category "Search Engine"

Googlebot 的 Math.random()

Hacker News Daily 上看到「Googlebot’s Javascript random() function is deterministic」這則有趣的發現。作者發現 Googlebot 的 Math.random() 並不隨機,甚至是固定的:

The first time Googlebot calls Math.random() the result will always be 0.14881141134537756, the second call will always be 0.19426893815398216. The script I linked to above simply uses this fact but obfuscates it a little and ‘seed’ it with something that doesn’t look too arbitrary.

需要無法預測的 random number (有安全性需求的) 應該用 RandomSource.getRandomValues() 這類函數,而不是用 Math.random(),所以這點倒是還好...

Twitter 推出 Full-archive search API

在先前的「Twitter 要推出 Premium API」這篇文章裡有提到 Twitter 打算在 Standard 與 Enterprise 兩個層級中間推出 Premium API,算是補產品線的概念,提供 Startup 有中間階段的服務可以使用。

而在昨天,Twitter 推出了 Full-archive search API:「Introducing the premium full-archive search endpoint」,從 Rate limit 就可以看出來對 Enterprise 不夠用,但對 Startup 應該有機會使用:

台灣用 Twitter 的量偏低,也許對專注在台灣的應用來說還好,但對國外的單位來說應該是多了不少變化可以玩...

DuckDuckGo 推出瀏覽器與套件

DuckDuckGo 推出了自家的瀏覽器與套件,標榜注重隱私:「DuckDuckGo moves beyond search to also protect you while browsing.」,支援多個平台 (然後 IEEdge 不在列表 XDDD):

Our updated app and extension are now available across all major platforms – Firefox, Safari, Chrome, iOS, and Android – so that you can easily get all the privacy essentials you need on any device with just one download.


測了一下 iOS 上的 app,用起來不算難用,應該會用一陣子測試看看...

Elsevier 讓德國的研究機構在還沒有續約的情況下繼續使用

德國的研究機構在 2017 年年底前,也就是與 Elsevier 的合約到期前,還是沒有續約,但 Elsevier 決定還是先繼續提供服務,暫時性的為期一年,繼續談判:

The Dutch publishing giant Elsevier has granted uninterrupted access to its paywalled journals for researchers at around 200 German universities and research institutes that had refused to renew their individual subscriptions at the end of 2017.

The institutions had formed a consortium to negotiate a nationwide licence with the publisher. They sought a collective deal that would give most scientists in Germany full online access to about 2,500 journals at about half the price that individual libraries have paid in the past. But talks broke down and, by the end of 2017, no deal had been agreed. Elsevier now says that it will allow the country’s scientists to access its paywalled journals without a contract until a national agreement is hammered out.

Elsevier 會這樣做主要是要避免讓德國的學術機構發現「沒有 Elsevier 其實也活的很好」。而不少研究人員已經知道這件事情,在大多數的情況下都有 Elsevier 的替代方案,不需要浪費錢簽那麼貴的費用:

Günter Ziegler, a mathematician at the Free University of Berlin and a member of the consortium's negotiating team, says that German researchers have the upper hand in the negotiations. “Most papers are now freely available somewhere on the Internet, or else you might choose to work with preprint versions,” he says. “Clearly our negotiating position is strong. It is not clear that we want or need a paid extension of the old contracts.”

替代方案有幾個方面,像是自由開放下載的 arXiv 愈來愈受到重視,很多研究者都會把投稿的論文在上面放一份 pre-print 版本 (甚至會更新),而且近年來有些知名的證明只放在上面 (像是 Poincaré conjecture)。而且放在人家家裡比放在自己網站來的簡單 (不需要自己維護),這都使得 arXiv 變成學術界新的標準平台。

除了 arXiv 外,其他領域也有自己習慣的平台。像是密碼學這邊的「Cryptology ePrint Archive」也運作很久了。

除了找平台外,放在自家網站上的論文 (通常是學校或是學術機構的個人空間),也因為搜尋引擎的發達,使得大家更容易找到對應檔案可以下載。

而且更直接的攻擊性網站是 Sci-Hub,讓大家從 paywall 下載後丟上去公開讓人搜尋。雖然因為常常被封鎖的原因而常常在換網址,不過透過 Tor Browser (或是自己設定 Tor Proxy) 存取他們的 Hidden Service 就應該沒這個問題。

希望德國可以撐下去,證明其實已經不需要 Elsevier...

Googlebot 的 Web rendering service 的細節

在「Polymer 2 and Googlebot」這邊文章裡面才看到 Google 官方在今年八月就有公開 Googlebot 所使用的 Web rendering service (WRS) 的細節:「Rendering on Google Search」。可以想像到是基於 Google Chrome 的修改:

Googlebot uses a web rendering service (WRS) that is based on Chrome 41 (M41). Generally, WRS supports the same web platform features and capabilities that the Chrome version it uses — for a full list refer to chromestatus.com, or use the compare function on caniuse.com.

裡面提到一些值得注意的事情,像是不支援 WebSocket,所以對於考慮 Google 搜尋結果的頁面來說,就要注意錯誤處理了...

Yahoo! 與 Mozilla 針對預設搜尋引擎的事情戰起來了...

Mozilla 先前終止與 Yahoo! 的合作後 (在 Firefox 內預設使用 Yahoo! 的搜尋引擎),Yahoo! 提告以及 Mozilla 還手的消息在最近被 Mozilla 揭露:「Mozilla Files Cross-Complaint Against Yahoo Holdings and Oath」。

Yahoo! 提告的檔案 (PDF) 在「2017-12-01-Yahoo-Redacted-Complaint.pdf」,Mozilla 還手的檔案 (PDF) 則是在「2017-12-05-Mozilla-Redacted-X-Complaint-with-Exhibits-and-POS.pdf」這邊。

Firefox 57 釋出時,Mozilla 就把預設的搜尋引擎改回 Google (參考「Mozilla terminates its deal with Yahoo and makes Google the default in Firefox again」),不過當時 Firefox 57 更大的消息是推出了 Quantum,讓瀏覽器的速度拉到可以跟目前的霸主 Google Chrome 競爭的程度,所以就沒有太多人注意到這件事情...

過了幾個禮拜消息比較退燒後,被告以及反過來告的消息出來後,才注意到原來換了搜尋引擎... XD


Google 的 .search 網域

Netcraft 的「November 2017 Web Server Survey」這篇看到關於 Google 弄到的 .search 網域:

This month the controversial new .search gTLD being run by Google’s Charleston Road Registry subsidiary was found for the first time, with www.nic.search responding to the survey. Google hopes it will be able to run .search as a dotless domain which will automatically redirect users to their search engine of choice. This proposal has been criticised for going against ICANN’s own rules, which prohibits this functionality due to the potential for conflicts with existing names on internal networks. This feature could also cause confusion for users who have come to expect that typing words into their address bar will perform a search query for that term.

It is currently uncertain whether or not Google will be allowed to run the .search TLD as a dotless domain, however with the launch of the first site on this TLD this month Google is one step closer to the provision of this service.

找了一下資料,ICANNWiki 上的「.search」有些資料,另外也有新聞資訊 (2013 年的):「Google Wants To Operate .Search As A “Dotless” Domain, Plans To Open .Cloud, .Blog And .App To Others」。



原文是講滲透測試的前置作業,需要將某個特定 domain 下的主機名稱掃出來:「A penetration tester’s guide to sub-domain enumeration」。

最直接的還是 DNS zone transfer (AXFR),如果管理者沒設好 DNS server 的話,這會是最快的方式。當沒有這個方法時就要用各種其他方式來掃了。


應該有人可以提到所有的東西再寫成程式 XD

MySQL 上的全文搜尋引擎:Mroonga

算是無意間翻到的資料,MySQL 上的全文搜尋引擎:「Mroonga」。

看起來後面主要是日本社群?從 2010 年就開始發展了,號稱 CJK 都支援,而且各大作業系統也都有預先包好的版本 (像是 Ubuntu 上有 PPA)。

雖然現在社群音量最大的應該還是 Elasticsearch,但看起來頗有趣的,對於只是想要架個小東西玩的專案,說不定是個有趣的方案?

對 Open Data 的攻擊手段

前陣子看到的「Membership Inference Attacks against Machine Learning Models」,裡面試著做到的攻擊手法:

[G]iven a data record and black-box access to a model, determine if the record was in the model's training dataset.

也就是拿到一組 Open Data 的存取權限,然後發展一套方法判斷某筆資料是否在裡面。而驗證攻擊的手法當然就是直接攻擊看效果:

We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.

透過 NN 攻擊 NN,而目前的解法也不太好處理,但有做總是會讓精確度降低。論文裡提到了四種讓難度增加的方法:

  • Restrict the prediction vector to top k classes.
  • Coarsen precision of the prediction vector.
  • Increase entropy of the prediction vector.
  • Use regularization.

另外一個值得看的資料是 2006 年發生的「AOL search data leak」,當年資料被放出來後有真實的使用者被找出來,也是很轟動啊...