Home » Computer » Archive by category "Search Engine" (Page 2)

掃網域下主機名稱的方式...

原文是講滲透測試的前置作業,需要將某個特定 domain 下的主機名稱掃出來:「A penetration tester’s guide to sub-domain enumeration」。

最直接的還是 DNS zone transfer (AXFR),如果管理者沒設好 DNS server 的話,這會是最快的方式。當沒有這個方法時就要用各種其他方式來掃了。

看了一下有幾種方式:

應該有人可以提到所有的東西再寫成程式 XD

MySQL 上的全文搜尋引擎:Mroonga

算是無意間翻到的資料,MySQL 上的全文搜尋引擎:「Mroonga」。

看起來後面主要是日本社群?從 2010 年就開始發展了,號稱 CJK 都支援,而且各大作業系統也都有預先包好的版本 (像是 Ubuntu 上有 PPA)。

雖然現在社群音量最大的應該還是 Elasticsearch,但看起來頗有趣的,對於只是想要架個小東西玩的專案,說不定是個有趣的方案?

對 Open Data 的攻擊手段

前陣子看到的「Membership Inference Attacks against Machine Learning Models」,裡面試著做到的攻擊手法:

[G]iven a data record and black-box access to a model, determine if the record was in the model's training dataset.

也就是拿到一組 Open Data 的存取權限,然後發展一套方法判斷某筆資料是否在裡面。而驗證攻擊的手法當然就是直接攻擊看效果:

We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.

透過 NN 攻擊 NN,而目前的解法也不太好處理,但有做總是會讓精確度降低。論文裡提到了四種讓難度增加的方法:

  • Restrict the prediction vector to top k classes.
  • Coarsen precision of the prediction vector.
  • Increase entropy of the prediction vector.
  • Use regularization.

另外一個值得看的資料是 2006 年發生的「AOL search data leak」,當年資料被放出來後有真實的使用者被找出來,也是很轟動啊...

直接從 IMDb 編號看影片

看到「Now Anyone Can Embed a Pirate Movie in a Website」這邊介紹的東西,直接輸入 IMDb 的編號 (包括 tt 開頭的那串編號),他就自動拉出 embed code:

然後可以直接線上觀看:

然後還支援字幕 (唔):

Interestingly, should one of those sources be Google Video, Vodlocker says its player offers Chromecast and subtitle support.

官網有寫來源是到處找:

VoDLocker searches all general video hosters like youtube, google drive, openload...

看起來整塊技術其實都是現成的。透過 search engine 加上定期的檢查機制與回報機制就可以做完 @_@

Google 與 Facebook 都在建立消息驗證系統

Google 的在「Fact Check now available in Google Search and News around the world」這,Facebook 的在「Working to Stop Misinformation and False News」這。

Google 是針對搜尋與新聞的部份給出建議,透過第三方的網站確認,像是這樣:

後面的機制是透過公開的協定進行:

For publishers to be included in this feature, they must be using the Schema.org ClaimReview markup on the specific pages where they fact check public statements (documentation here), or they can use the Share the Facts widget developed by the Duke University Reporters Lab and Jigsaw.

但也是透過演算法判斷提供的單位是否夠權威:

Only publishers that are algorithmically determined to be an authoritative source of information will qualify for inclusion.

而 Facebook 是針對 Timeline 上的新聞判斷,但是是透過與 Facebook 合作的 partner 判斷,而且會針對判斷為假的消息降低出現的機率:

We’ve started a program to work with independent third-party fact-checking organizations. We’ll use the reports from our community, along with other signals, to send stories to these organizations. If the fact-checking organizations identify a story as false, it will get flagged as disputed and there will be a link to a corresponding article explaining why. Stories that have been disputed also appear lower in News Feed.

我不是很喜歡 Facebook 的方法,變相的在控制言論自由 (不過也不是第一天了)。

完全分散式的 BitTorrent 搜尋引擎

BitTorrent 已經有足夠的技術與環境 (ecosystem) 做出完全分散式架構的搜尋引擎了,類似於 eDonkey Network (或是說後來變成主流的 eMule) 上的 search 功能,但一直沒看到類似的東西...

magnetico 算是一個嘗試,完全透過 DHT 搜尋取得結果:

不過這套軟體的 license 是攻擊性超強的 AGPLv3,算是實驗性質吧。要真正普及應該要像 eMule 一樣直接做進 client 內...

Cloudbleed:Cloudflare 這次的安全問題

Cloudflare 把完整的時間軸與影響範圍都列出來了:「Incident report on memory leak caused by Cloudflare parser bug」。

出自於 2/18 時 GoogleTavis Ormandy 直接在 Twitter 上找 Cloudflare 的人:

Google 的 Project Zero 上的資料:「cloudflare: Cloudflare Reverse Proxies are Dumping Uninitialized Memory」。

起因在於 bug 造成有時候會送出不應該送的東西,可能包含了敏感資料:

It turned out that in some unusual circumstances, which I’ll detail below, our edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data.

不過這邊不包括 SSL 的 key,主要是因為隔離開了:

For the avoidance of doubt, Cloudflare customer SSL private keys were not leaked. Cloudflare has always terminated SSL connections through an isolated instance of NGINX that was not affected by this bug.

不過由於這些敏感資料甚至還被 Google 收進 search engine,算是相當的嚴重,所以不只是 Cloudflare 得修好這個問題,還得跟眾多的 search engine 合作將這些資料移除:

Because of the seriousness of such a bug, a cross-functional team from software engineering, infosec and operations formed in San Francisco and London to fully understand the underlying cause, to understand the effect of the memory leakage, and to work with Google and other search engines to remove any cached HTTP responses.

bug 影響的時間從 2016/09/22 開始:

2016-09-22 Automatic HTTP Rewrites enabled
2017-01-30 Server-Side Excludes migrated to new parser
2017-02-13 Email Obfuscation partially migrated to new parser
2017-02-18 Google reports problem to Cloudflare and leak is stopped

而以 2/13 到 2/18 的流量反推估算,大約是 0.00003% 的 request 會可能產生這樣的問題:

The greatest period of impact was from February 13 and February 18 with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage (that’s about 0.00003% of requests).

不過不得不說 Tavis Ormandy 真的很硬,在沒有 source code 以及 Cloudflare 幫助的情況下直接打出可重製的步驟:

I worked with cloudflare over the weekend to help clean up where I could. I've verified that the original reproduction steps I sent cloudflare no longer work.

事發後完整的時間軸:

2017-02-18 0011 Tweet from Tavis Ormandy asking for Cloudflare contact information
2017-02-18 0032 Cloudflare receives details of bug from Google
2017-02-18 0040 Cross functional team assembles in San Francisco
2017-02-18 0119 Email Obfuscation disabled worldwide
2017-02-18 0122 London team joins
2017-02-18 0424 Automatic HTTPS Rewrites disabled worldwide
2017-02-18 0722 Patch implementing kill switch for cf-html parser deployed worldwide
2017-02-20 2159 SAFE_CHAR fix deployed globally
2017-02-21 1803 Automatic HTTPS Rewrites, Server-Side Excludes and Email Obfuscation re-enabled worldwide

另外在「List of Sites possibly affected by Cloudflare's #Cloudbleed HTTPS Traffic Leak」這邊有人整理出受影響的大站台有哪些 (小站台就沒列上去了)。

Google Search 是否執行 JavaScript...

在討論 Google Search 是不是會執行 javascript。半年前的文章了,不過最近作者發現又有新東西:「Does Google execute JavaScript?」。

作者用了幾段 javascript 程式碼測試 (可以參考原文),發現其實不保證會執行 javascript,所以還是建議大家乖乖跑 server-side render 產生資料,這樣對其他搜尋引擎的 SEO 也比較好:

Archives