Home » Computer » Archive by category "Search Engine" (Page 2)

Googlebot 的 Web rendering service 的細節

在「Polymer 2 and Googlebot」這邊文章裡面才看到 Google 官方在今年八月就有公開 Googlebot 所使用的 Web rendering service (WRS) 的細節:「Rendering on Google Search」。可以想像到是基於 Google Chrome 的修改:

Googlebot uses a web rendering service (WRS) that is based on Chrome 41 (M41). Generally, WRS supports the same web platform features and capabilities that the Chrome version it uses — for a full list refer to chromestatus.com, or use the compare function on caniuse.com.

裡面提到一些值得注意的事情,像是不支援 WebSocket,所以對於考慮 Google 搜尋結果的頁面來說,就要注意錯誤處理了...

Yahoo! 與 Mozilla 針對預設搜尋引擎的事情戰起來了...

Mozilla 先前終止與 Yahoo! 的合作後 (在 Firefox 內預設使用 Yahoo! 的搜尋引擎),Yahoo! 提告以及 Mozilla 還手的消息在最近被 Mozilla 揭露:「Mozilla Files Cross-Complaint Against Yahoo Holdings and Oath」。

Yahoo! 提告的檔案 (PDF) 在「2017-12-01-Yahoo-Redacted-Complaint.pdf」,Mozilla 還手的檔案 (PDF) 則是在「2017-12-05-Mozilla-Redacted-X-Complaint-with-Exhibits-and-POS.pdf」這邊。

Firefox 57 釋出時,Mozilla 就把預設的搜尋引擎改回 Google (參考「Mozilla terminates its deal with Yahoo and makes Google the default in Firefox again」),不過當時 Firefox 57 更大的消息是推出了 Quantum,讓瀏覽器的速度拉到可以跟目前的霸主 Google Chrome 競爭的程度,所以就沒有太多人注意到這件事情...

過了幾個禮拜消息比較退燒後,被告以及反過來告的消息出來後,才注意到原來換了搜尋引擎... XD

旁邊搖旗吶喊沒什麼用,就拉板凳出來看吧...

Google 的 .search 網域

Netcraft 的「November 2017 Web Server Survey」這篇看到關於 Google 弄到的 .search 網域:

This month the controversial new .search gTLD being run by Google’s Charleston Road Registry subsidiary was found for the first time, with www.nic.search responding to the survey. Google hopes it will be able to run .search as a dotless domain which will automatically redirect users to their search engine of choice. This proposal has been criticised for going against ICANN’s own rules, which prohibits this functionality due to the potential for conflicts with existing names on internal networks. This feature could also cause confusion for users who have come to expect that typing words into their address bar will perform a search query for that term.

It is currently uncertain whether or not Google will be allowed to run the .search TLD as a dotless domain, however with the launch of the first site on this TLD this month Google is one step closer to the provision of this service.

找了一下資料,ICANNWiki 上的「.search」有些資料,另外也有新聞資訊 (2013 年的):「Google Wants To Operate .Search As A “Dotless” Domain, Plans To Open .Cloud, .Blog And .App To Others」。

不太妙...

掃網域下主機名稱的方式...

原文是講滲透測試的前置作業,需要將某個特定 domain 下的主機名稱掃出來:「A penetration tester’s guide to sub-domain enumeration」。

最直接的還是 DNS zone transfer (AXFR),如果管理者沒設好 DNS server 的話,這會是最快的方式。當沒有這個方法時就要用各種其他方式來掃了。

看了一下有幾種方式:

應該有人可以提到所有的東西再寫成程式 XD

MySQL 上的全文搜尋引擎:Mroonga

算是無意間翻到的資料,MySQL 上的全文搜尋引擎:「Mroonga」。

看起來後面主要是日本社群?從 2010 年就開始發展了,號稱 CJK 都支援,而且各大作業系統也都有預先包好的版本 (像是 Ubuntu 上有 PPA)。

雖然現在社群音量最大的應該還是 Elasticsearch,但看起來頗有趣的,對於只是想要架個小東西玩的專案,說不定是個有趣的方案?

對 Open Data 的攻擊手段

前陣子看到的「Membership Inference Attacks against Machine Learning Models」,裡面試著做到的攻擊手法:

[G]iven a data record and black-box access to a model, determine if the record was in the model's training dataset.

也就是拿到一組 Open Data 的存取權限,然後發展一套方法判斷某筆資料是否在裡面。而驗證攻擊的手法當然就是直接攻擊看效果:

We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.

透過 NN 攻擊 NN,而目前的解法也不太好處理,但有做總是會讓精確度降低。論文裡提到了四種讓難度增加的方法:

  • Restrict the prediction vector to top k classes.
  • Coarsen precision of the prediction vector.
  • Increase entropy of the prediction vector.
  • Use regularization.

另外一個值得看的資料是 2006 年發生的「AOL search data leak」,當年資料被放出來後有真實的使用者被找出來,也是很轟動啊...

直接從 IMDb 編號看影片

看到「Now Anyone Can Embed a Pirate Movie in a Website」這邊介紹的東西,直接輸入 IMDb 的編號 (包括 tt 開頭的那串編號),他就自動拉出 embed code:

然後可以直接線上觀看:

然後還支援字幕 (唔):

Interestingly, should one of those sources be Google Video, Vodlocker says its player offers Chromecast and subtitle support.

官網有寫來源是到處找:

VoDLocker searches all general video hosters like youtube, google drive, openload...

看起來整塊技術其實都是現成的。透過 search engine 加上定期的檢查機制與回報機制就可以做完 @_@

Google 與 Facebook 都在建立消息驗證系統

Google 的在「Fact Check now available in Google Search and News around the world」這,Facebook 的在「Working to Stop Misinformation and False News」這。

Google 是針對搜尋與新聞的部份給出建議,透過第三方的網站確認,像是這樣:

後面的機制是透過公開的協定進行:

For publishers to be included in this feature, they must be using the Schema.org ClaimReview markup on the specific pages where they fact check public statements (documentation here), or they can use the Share the Facts widget developed by the Duke University Reporters Lab and Jigsaw.

但也是透過演算法判斷提供的單位是否夠權威:

Only publishers that are algorithmically determined to be an authoritative source of information will qualify for inclusion.

而 Facebook 是針對 Timeline 上的新聞判斷,但是是透過與 Facebook 合作的 partner 判斷,而且會針對判斷為假的消息降低出現的機率:

We’ve started a program to work with independent third-party fact-checking organizations. We’ll use the reports from our community, along with other signals, to send stories to these organizations. If the fact-checking organizations identify a story as false, it will get flagged as disputed and there will be a link to a corresponding article explaining why. Stories that have been disputed also appear lower in News Feed.

我不是很喜歡 Facebook 的方法,變相的在控制言論自由 (不過也不是第一天了)。

Archives