Google 與 Facebook 都在建立消息驗證系統

Google 的在「Fact Check now available in Google Search and News around the world」這,Facebook 的在「Working to Stop Misinformation and False News」這。

Google 是針對搜尋與新聞的部份給出建議,透過第三方的網站確認,像是這樣:

後面的機制是透過公開的協定進行:

For publishers to be included in this feature, they must be using the Schema.org ClaimReview markup on the specific pages where they fact check public statements (documentation here), or they can use the Share the Facts widget developed by the Duke University Reporters Lab and Jigsaw.

但也是透過演算法判斷提供的單位是否夠權威:

Only publishers that are algorithmically determined to be an authoritative source of information will qualify for inclusion.

而 Facebook 是針對 Timeline 上的新聞判斷,但是是透過與 Facebook 合作的 partner 判斷,而且會針對判斷為假的消息降低出現的機率:

We’ve started a program to work with independent third-party fact-checking organizations. We’ll use the reports from our community, along with other signals, to send stories to these organizations. If the fact-checking organizations identify a story as false, it will get flagged as disputed and there will be a link to a corresponding article explaining why. Stories that have been disputed also appear lower in News Feed.

我不是很喜歡 Facebook 的方法,變相的在控制言論自由 (不過也不是第一天了)。

Airbnb 被抓到操作站上資料以美化數據

在「How Airbnb's Data hid the Facts in New York City」這篇文章裡提到了 Airbnb 在去年 (2015 年) 十一月時操作站上資料,美化數據的證據。

Airbnb 在 2015 年 12 月時發表了一篇「Data on the Airbnb Community in NYC」,說明 Airbnb 對紐約地區的貢獻的種種之類的 PR 文章。

Airbnb 的文章裡面提到了資料是取自 2015 年 11 月 17 日的資料:

As of November 17, 2015 there were 35,966 active Airbnb listings in New York.

而作者則發現了 2015 年 11 月 17 日當天,Airbnb 站上的資料被「清理」過:

A major part of Airbnb's recent data release was a snapshot of New York City listings as of November 17, 2015. This report shows that the snapshot was photoshopped: in the days leading up to November 17, Airbnb ensured a flattering picture by carrying out a one-time targeted purge of more than 1,000 listings. The company then presented November 17 as a typical day in the company’s operations and mis-represented the one-time purge as a historical trend.

而且只針對紐約地區清理:

No similar event took place in other cities in North America or elsewhere.

完整的分析在「how_airbnbs_data_hid_the_facts_in_new_york_city.pdf」可以取得 PDF 檔,可以看到裡面同時有兩個不同資料來源的分析並確認 (Murray Cox 與 Tom Slee 所蒐集的資料)。

Google 發表計算網頁真實性的演算法 (Knowledge-Based Trust)

Slashdot 上看到 Google 發表了計算網頁真實性的演算法,Knowledge-Based Trust (KBT):「Google Wants To Rank Websites Based On Facts Not Links」,原始的論文 PDF 檔案可以在「Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources」這邊取得。

論文本身的原理不難懂 (其實方法相當有趣),主要是給出了三個貢獻。

首先是能夠區分是取出資訊的方法有問題 (extract 的演算法不夠好),或是網站本身就給出錯誤的資訊:

Our main contribution is a more sophisticated probabilistic model, which can distinguish between two main sources of error: incorrect facts on a page, and incorrect extractions made by an extraction system.

第二個則是在效能上的改善:

Our second contribution is a new method to adaptively decide the granularity of sources to work with: if a specific webpage yields too few triples, we may aggregate it with other webpages from the same website. Conversely, if a website has too many triples, we may split it into smaller ones, to avoid computational bottlenecks (Section 4).

第三個則是提出好的分散式演算法,可以螞蟻雄兵計算出來:

The third contribution of this paper is a detailed, large-scale evaluation of the performance of our model.

KBT 並不是要取代 PageRank,而是跟 PageRank 互相配合,可以有效打擊內容農場 (Content farm) 這類網站,畢竟 PageRank 的假設在一般的狀況下是有邏輯的。

在「High PageRank but low KBT (top-left corner)」這段講到了這件事情:

We consider the 15 gossip websites listed in [16]. Among them, 14 have a PageRank among top 15% of the websites, since such websites are often popular. However, for all of them the KBT are in the bottom 50%; in other words, they are considered less trustworthy than half of the websites. Another kind of websites that often get low KBT are forum websites.

再找時間細讀其他類似的演算法...