Google 發表計算網頁真實性的演算法 (Knowledge-Based Trust)

Slashdot 上看到 Google 發表了計算網頁真實性的演算法,Knowledge-Based Trust (KBT):「Google Wants To Rank Websites Based On Facts Not Links」,原始的論文 PDF 檔案可以在「Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources」這邊取得。

論文本身的原理不難懂 (其實方法相當有趣),主要是給出了三個貢獻。

首先是能夠區分是取出資訊的方法有問題 (extract 的演算法不夠好),或是網站本身就給出錯誤的資訊:

Our main contribution is a more sophisticated probabilistic model, which can distinguish between two main sources of error: incorrect facts on a page, and incorrect extractions made by an extraction system.

第二個則是在效能上的改善:

Our second contribution is a new method to adaptively decide the granularity of sources to work with: if a specific webpage yields too few triples, we may aggregate it with other webpages from the same website. Conversely, if a website has too many triples, we may split it into smaller ones, to avoid computational bottlenecks (Section 4).

第三個則是提出好的分散式演算法,可以螞蟻雄兵計算出來:

The third contribution of this paper is a detailed, large-scale evaluation of the performance of our model.

KBT 並不是要取代 PageRank,而是跟 PageRank 互相配合,可以有效打擊內容農場 (Content farm) 這類網站,畢竟 PageRank 的假設在一般的狀況下是有邏輯的。

在「High PageRank but low KBT (top-left corner)」這段講到了這件事情:

We consider the 15 gossip websites listed in [16]. Among them, 14 have a PageRank among top 15% of the websites, since such websites are often popular. However, for all of them the KBT are in the bottom 50%; in other words, they are considered less trustworthy than half of the websites. Another kind of websites that often get low KBT are forum websites.

再找時間細讀其他類似的演算法...

Firefox 十週年

Mozilla 的 blog 上提到了 Firefox 十週年:「Celebrating 10 Years of Firefox」。

2002 年,Mozilla 推出了 Firebird,後來因為商標名稱的衝突,在 2004 年 11 月 9 日改名叫 Firefox,並推出 Firefox 1.0:「Mozilla Foundation releases the highly anticipated Mozilla Firefox 1.0 web browser」。

Firebird/Firefox 的出現改變了整個 web,讓人有所選擇,並且推動了 web 的發展。如果當年沒有 Firebird/Firefox,HTML5 這些新技術不知道要晚幾年才會出現。

Firefox 最新的版本增加了 Forget 功能,可以「遺忘」這五分鐘的資料,或是兩小時、24 小時:

這功能頗有趣的 XD

nginx 打算使用 JavaScript 的方法...

nginx 的創辦人在接受 InfoWorld 訪問時提到了打算使用 JavaScript 做為設定檔的計畫:「The company plans to let you use JavaScript as an application language in its eponymous Web server」。

We're planning JavaScript configurations, using JavaScript in [an] Nginx configuration. We plan to be more efficient on these [configurations], and we plan to develop a flexible application platform. You can use JavaScript snippets inside configurations to allow more flexible handling of requests, to filter responses, to modify responses. Also, eventually, JavaScript can be used as [an] application language for Nginx. Currently we have only Perl and Lua [supported in Nginx]. Perl is our own model, and Lua is a third-party model.

目前的設定檔算是 DSL?還蠻有趣的想法...

CloudFront 統計資料繼續改善...

半年多前 CloudFront 在 web console 上實做了統計資訊:「Amazon CloudFront 可以從 Web Console 上看到統計資料了」,而今天可以看到更多東西了:「CloudFront Update - Trends, Metrics, Charts, More Timely Logs」。

首先是確保收到 log 的時間差,目前可以確保一個小時內會收到 log:

With these changes, the newest log files in your bucket will reflect events that have happened as recently as an hour ago.

再來是多了 cache 分析:

以及熱門物件統計資訊:

也是其他 CDN 業者都已經有的功能,算是在補 :p

AWS 的 ELB 可以自訂 HTTP/HTTPS Timeout 時間了

Elastic Load Balancing 之前的 timeout 時間是預設值 60 秒,現在可以自訂時間了:「Elastic Load Balancing Connection Timeout Management」。

文章裡有提到好處:

Some applications can benefit from a longer timeout because they create a connection and leave it open for polling or extended sessions. Other applications tend to have short, non- recurring requests to AWS and the open connection will hardly ever end up being reused.

目前可以設定 1 秒到 3600 秒,預設值是 60 秒。

WebScaleSQL

WebScaleSQLFacebookGoogleLinkedIn 以及 Twitter 四家公司對 MySQL 5.6 的 fork。

Percona 的人也針對 WebScaleSQL 與 Percona Server 5.6 的比較,寫了一篇技術分析的文章:「A technical WebScaleSQL review and comparison with Percona Server」。

Percona 那篇分析文章提到不少改善屬於比較激進類型,對於 Percona Server 以及 MySQL 官方版本的定位並不適合。

而在開發上,語法也換到 C99C++11,也就是打算拋棄很舊的系統 (沒有 C99/C++11 compiler)。不過就這點來說,使得這上面的 patch 要 backport 回 MySQL 又增加了一些狀況...

WebScaleSQL 的出現代表 MySQL fork 又多一家,而且因為背後是這四家公司搞出來的,看起來聲勢會很浩大。進入「合久必分」的階段...