Home » Posts tagged "spam"

幹掉 Twitter 的推薦功能

Update:官方的「Show the best Tweets first」設定已經更新了,理論上你關掉後就不會出現這些 spam:

以下是原來的文章。


Twitter 預設是照時間排序,但是在這幾年就加入了「你可能會感興趣的 blah blah blah」之類的官方 spam,或是把 like 的插進來。

先前要幹掉這些功能頗麻煩的,我是看到就按不喜歡,然後大概就會有幾個禮拜不會出現,然後過一陣子後又冒出來...

最近有人找到方法可以一勞永逸的幹掉這些功能,找資料看起來是從這篇開始的,有人發現只要 mute 一些關鍵字就可以把這些功能從 timeline 上幹掉:

我找資料發現這些字其實是出現在 data attribute 上:

然後就有人開始翻開始整理了:

目前看起來效果頗不錯啊,這樣連手機上都可以擋掉這些 spam,頗不賴 :o

Hacker News 的潛規則

在「A List of Hacker News's Undocumented Features and Behaviors」這邊列了不少 Hacker News 的潛規則,看過後其實比較重要的是「當你需要自己實做一個類似的系統時,有哪些歷史教訓是人家已經走過的」。

像是 Anti-Voting Manipulation 與 Flame-War Detector 都是蠻常見的情境,Shadowbanning 則是防治廣告機制中比較軟性的一環。Green Usernames 也算是軟性的機制...

另外產品面上,Hacker News 也設計一些常見的 list 讓使用者除了首頁以外的選擇。

GitHub 的 Gist 要移除匿名發表的功能了...

GitHubGist 變成要註冊使用者才能貼了:「Deprecation notice: Removing anonymous gist creation」。主要的原因也還是因為太多 spam 之類的訊息:

In 30 days, we'll be deprecating anonymous gist creation—a decision we made after a lot of deliberation. Anonymous gists are a handy tool for quickly putting a code snippet online, but as the only way to create anonymous content on GitHub, they also see a large volume of spam. In addition, many people already have a combination of tools authenticated with GitHub that allow them to create gists they own.

預定是 3/19 關閉... 只好繼續貼 Pastebin 了... XD

Amazon SES 提供 Dedicated IP Address 的養 IP 機制...

透過混搭本來的 shared IP address (已經對各 ISP 有信用) 與客戶租用的 dedicated IP address 混發,然後一段時間後養起來 (最多 45 days):「Amazon SES Can Now Automatically Warm Up Your Dedicated IP Addresses」。

Amazon SES controls the amount of email that can be sent through an IP address. Amazon SES uses a predefined warm-up plan that indicates the maximum number of emails that can be sent daily through an IP address to ensure the traffic is increasing gradually over 45 days.

要注意的是,這個過程需要發五萬封才有辦法養出來,不是設上去就會自己養

After you successfully warm up your dedicated IPs (either by yourself or by using the Amazon SES automatic warm-up mechanism), you must send at least 50,000 emails per dedicated IP per day so that the IPs maintain a positive reputation with ISPs. To avoid throttling by the ISPs, avoid sending a high volume of emails soon after the completion of warm-up; we recommend gradually increasing the volume for better deliverability.

現在弄 mail service 超麻煩的...

自建 Mail System 的難度

Hacker News 上的「Ask HN: Is it possible to run your own mail server for personal use?」這篇道出了現在自建 mail system 的難度。作者遇到信件常常被各大 mail 服務歸類成 spam:

The problem is making sure my mail is not marked as spam by the major MTAs out there, gmail and hotmail both mark my mails as spam.

整理一下現在自己建 mail system 要做到哪些事情:

  • 確認 IP (包括 IPv4/IPv6) 沒有列入任何 Open Relay 清單中。
  • 確認 IP 的反解可以查出對應的正解。
  • 確認 SPF 設定。
  • 確認送出去的信件有 DKIM 簽名,而且 DNS 也有設上對應的設定。
  • 確認 TLS 的發送與接收都正常。
  • 確認 DMARC 機制正確運作。

如同「Exercising Software Freedom in the Global Email System」這邊講的,現在要自己搞 mail system 超累...

Google 發表計算網頁真實性的演算法 (Knowledge-Based Trust)

Slashdot 上看到 Google 發表了計算網頁真實性的演算法,Knowledge-Based Trust (KBT):「Google Wants To Rank Websites Based On Facts Not Links」,原始的論文 PDF 檔案可以在「Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources」這邊取得。

論文本身的原理不難懂 (其實方法相當有趣),主要是給出了三個貢獻。

首先是能夠區分是取出資訊的方法有問題 (extract 的演算法不夠好),或是網站本身就給出錯誤的資訊:

Our main contribution is a more sophisticated probabilistic model, which can distinguish between two main sources of error: incorrect facts on a page, and incorrect extractions made by an extraction system.

第二個則是在效能上的改善:

Our second contribution is a new method to adaptively decide the granularity of sources to work with: if a specific webpage yields too few triples, we may aggregate it with other webpages from the same website. Conversely, if a website has too many triples, we may split it into smaller ones, to avoid computational bottlenecks (Section 4).

第三個則是提出好的分散式演算法,可以螞蟻雄兵計算出來:

The third contribution of this paper is a detailed, large-scale evaluation of the performance of our model.

KBT 並不是要取代 PageRank,而是跟 PageRank 互相配合,可以有效打擊內容農場 (Content farm) 這類網站,畢竟 PageRank 的假設在一般的狀況下是有邏輯的。

在「High PageRank but low KBT (top-left corner)」這段講到了這件事情:

We consider the 15 gossip websites listed in [16]. Among them, 14 have a PageRank among top 15% of the websites, since such websites are often popular. However, for all of them the KBT are in the bottom 50%; in other words, they are considered less trustworthy than half of the websites. Another kind of websites that often get low KBT are forum websites.

再找時間細讀其他類似的演算法...

把舊站的 comment 關掉...

這幾天 blog 常常出現 504,連到機器上發現 MySQL 很忙,才發現是舊站 spam 的量太大的問題造成的,把舊站的 comment 關掉後就好不少,現在看機器的 CPU loading 應該是正常多了。

新站已經是用 DISQUS 所以比較沒有 comment spam 的問題。舊站那邊除了關掉 comment 外,另外直接進 MySQL 把 2006 年之後的 comment 都丟到 pending 裡面去。(因為 2005 年 8 月後就沒更新舊站了)

來研究看看要怎麼把舊的 comment 丟進 Akismet...

Update:發現只要按下 Check for Spam 的按鈕就會把目前 Pending comments 都丟去 Akismet 掃,先丟著 :p

Archives