微軟的 Outlook 系統會自動點擊信件內的連結

前幾天在 Hacker News Daily 上翻到的,微軟的 Outlook 系統 (雲端上的系統) 會自動點擊信件內的連結,導致一堆問題:「“Magic links” can end up in Bing search results — rendering them useless.」,在 Hacker News 上的討論也有很多受害者出來抱怨:「“Magic links” can end up in Bing search results, rendering them useless (medium.com/ryanbadger)」。

原文的標題寫的更批評,指控 Outlook 會把這些 link 丟到 Bing 裡面 index,這點還沒有看到確切的證據。

先回到連結被點擊的問題,照文章內引用的資料來看,看起來是 2017 年開始就有的情況:「Do any common email clients pre-fetch links rather than images?」。

As of Feb 2017 Outlook (https://outlook.live.com/) scans emails arriving in your inbox and it sends all found URLs to Bing, to be indexed by Bing crawler.

在 Hacker News 上的討論也提到了像是 one-time login email 的機制也會因此受到影響,被迫要用比較費工夫的方法讓使用者登入 (像是給使用者 one-time code 輸入,而不是點 link 就可以登入)。

先記起來,以後在設計時應該會遇到,要重新思考 threat model...

curl 的 TLS fingerprint 偽裝專案 curl-impersonate 支援 Chrome 了

先前在「修正 Curl 的 TLS handshake,避開 bot 偵測機制」這邊提到 curl-impersonate 這個專案,試著修改 curl 的 TLS handshake fingerprint 讓偽裝的更好,本來只支援 Firefox,現在則是支援 Google Chrome 的 fingerprint 了...

作者寫的兩篇說明文章也可以看看:「Making curl impersonate Firefox」、「Impersonating Chrome, too」。

看起來愈來愈完整了,連 LD_PRELOAD 的用法也出現了,然後在 Arch Linux 上也出現 AUR 可以用了...

繞過 Web 上「防機器人」機制的資料

這兩天的 Hacker News 冒出一些討論在講 Web 上「防機器人」機制要怎麼繞過:

第一篇主要是從各種面向都一起討論,從大方向的分類討論 (「Where to begin building undetectable bot?」),另外介紹目前有哪些產品 (在「List of anti-bot software providers」這邊)。

在文章裡有提到一個有意思的工具「puppeteer-extra-plugin-stealth」,主要是在 Node.js 類的環境,查了一下在 Python 上也有 pyppeteer-stealth,不過 Python 版本直接講了不完美 XDDD

Transplanted from puppeteer-extra-plugin-stealth, Not perfect.

第二篇文章在開頭就提到他不是很愛 Proxy,因為 Proxy 很容易偵測。在文章最後面則是提到了兩個方案,第一個是用大量便宜的 Android 手機加上 Data SIM 來跑,另外一個是直接用 Android 模擬器加上 4G 網卡跑。

依照這些想法,好像可以來改善一下手上的 RSS 工具...

利用 Cloudflare Workers 繞過 Cloudflare 自家的阻擋機制

Hacker News 首頁上看到「How to bypass Cloudflare bot protection (jychp.medium.com)」這則,裡面的文章是「How to bypass CloudFlare bot protection ?」這篇,利用 Cloudflare Workers 繞過 Cloudflare 自家的 CAPTCHA 機制。

這個漏洞有先被送給 Cloudflare,但被認為不是問題,所以作者就決定公開:

Several months ago I submitted what appeared to be a security flaw to CloudFalre’s bugbounty program. According to them, this is not a problem, it’s up to you to make up your own mind.

技術上就是透過 Cloudflare Workers 當作 proxy server,只是看起來 Cloudflare 對自家 IP 有特別處理,在設定妥當後,用 Cloudflare Workers 的 IP address 去連 Cloudflare 的站台,幾乎不會觸發 Cloudflare 的阻擋機制。

不過 free tier 還是有限制,主要就是數量:

The first 100,000 requests each day are free and paid plans start at just $5/10 million requests, making Workers as much as ten-times less expensive than other serverless platforms.


So let’s enjoy the 100 000 request/day for your free Cloudflare account and go scrape the world !


蘋果也搞了個 Applebot 掃資料

Hacker News Daily 上翻到的:「About Applebot」,另外 Hacker News 上的討論也蠻有趣的,可以看看:「Applebot (support.apple.com)」。

目前的用途是用在 Siri 之類的 bot:

Applebot is the web crawler for Apple. Products like Siri and Spotlight Suggestions use Applebot.

裡面有提到辨識方式,IP 會使用,反解會是 *.applebot.apple.com

Traffic coming from Applebot is identified by its user agent, and reverse DNS shows it in the *.applebot.apple.com domain, originating from the net block.

另外 User-agent 也可以看出:

Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko) Version/Safari_version Safari/WebKit_version (Applebot/Applebot_version)

後面有提到 search engine 的部份 (About search rankings),這點讓大家在猜蘋果是不是開始在弄 search engine 了,在 Hacker News 上的討論串裡面可以看到不少對於蘋果自己搞 search engine 的猜測。

然後也有些頗有趣的,像是爆料當初開發的過程遇到的問題 XD

jd20 3 days ago [–]

Some fun facts:
- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).

- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...

Source: I worked on the original version.

Cloudflare 改用自己的 CAPTCHA 服務 hCaptcha

CloudflareGooglereCAPTCHA 改用自家的 hCaptcha:「Moving from reCAPTCHA to hCaptcha」。

看起來其實就是錢的問題,reCAPTCHA 要收費了,而以 Cloudflare 的量會太貴:

Earlier this year, Google informed us that they were going to begin charging for reCAPTCHA. That is entirely within their right. Cloudflare, given our volume, no doubt imposed significant costs on the reCAPTCHA service, even for Google.

另外 hCaptcha 有提供免費版本給一般網站用,剛出來這幾天等白老鼠寫心得後,再決定要不要跳進去測試...

cdnjs 轉移到 Cloudflare 負責維護

不確定是 cdnjs 還是 CDNJS,因為官方網站是小寫,但 GitHub 上是大寫...

Anyway,cdnjs 本來由社群維護更新 (實際上是透過 bot 更新,但 bot 本身也需要維護),因為人力時間的因素,轉移給 Cloudflare 負責了:「An Update on CDNJS」。

這次也更新了 cdnjs 的 daily request 數量,可以看到現在大約是每天六十億次:

本來 Cloudflare 是站在贊助頻寬的角色提供服務:

Within Cloudflare’s infrastructure there is a set of machines which are responsible for pulling the latest version of the repo periodically. Those machines then become the origin for cdnjs.cloudflare.com, with Cloudflare’s Global Load Balancer automatically handling failures. Cloudflare’s cache automatically stores copies of many of the projects making it possible for us to deliver them quickly from all 195 of our data centers.

但更新的 bot 本身掛了,而且維護者沒時間修:

Unfortunately approximately thirty days ago one of those bots stopped working, preventing updated projects from appearing in CDNJS. The bot's open-source maintainer was not able to invest the time necessary to keep the bot running. After several weeks we were asked by the community and the CDNJS founders to take over maintenance of the CDNJS repo itself.

所以現在則是 Cloudflare 接手維護了:

This means the Cloudflare engineering team is taking responsibility for keeping the contents of github.com/cdnjs/cdnjs up to date, in addition to ensuring it is correctly served on cdnjs.cloudflare.com.

不過裡面也提到了一個問題,就是現在瀏覽器為了安全性,對於不同的站台會有不同的 cache,本來 cdnjs 的設計目的之一被大幅削弱,現在只剩下省頻寬了:

The future value of CDNJS is now in doubt, as web browsers are beginning to use a separate cache for every website you visit. It is currently used on such a wide swath of the web, however, it is unlikely it will be disappearing any time soon.

用 PoW 當作防機器人的方式

看到「wehatecaptchas」這個服務試著用 PoW (Proof of work) 擋機器人...

這個方式不需要像 GooglereCAPTCHA 那樣蒐集大量行為 (對隱私不利),也不需要解一堆奇怪的圖片問題。

CAPTCHA 最常用的領域,也就是擋 spam 這件事情來說,PoW 這樣的單一方式應該是不夠,但可以當作綜合方法裡面的一種...

AI 版的星海爭霸二將直接透過歐洲區的 Battle.net 匿名與人類對戰

前幾天 Blizzard 公佈的消息,DeepMind 的星海爭霸二 AI (AlphaStar) 將會透過 Blizzard 的 Battle.net 歐洲區伺服器跟人類對戰:「DeepMind Research on Ladder」。

Experimental versions of DeepMind’s StarCraft II agent, AlphaStar, will soon play a small number of games on the competitive ladder in Europe as part of ongoing research into AI.


If you would like the chance to help DeepMind with its research by matching against AlphaStar, you can opt in by clicking the “opt-in” button on the in-game popup window. You can alter your opt-in selection at any time by using the “DeepMind opt-in” button on the 1v1 Versus menu.

但你仍然不會知道對手是人還是 AI,而且如同一般對戰情況,這會影響到你的戰績:

For scientific test purposes, DeepMind will be benchmarking AlphaStar’s performance by playing anonymously during a series of blind trial matches. This means the StarCraft community will not know which matches AlphaStar is playing, to help ensure that all games are played under the same conditions. AlphaStar plays with built-in restrictions that the DeepMind team has defined in consultation with pro players. A win or a loss against AlphaStar will affect your MMR as normal.


加州從今年七月開始,禁止 AI 偽裝成人類 (前幾天也有一些新聞在報導):「A California law now means chatbots have to disclose they’re not human」,對應的法條在「Bill Text - SB-1001 Bots: disclosure」這邊可以看到:

17941. (a) It shall be unlawful for any person to use a bot to communicate or interact with another person in California online, with the intent to mislead the other person about its artificial identity for the purpose of knowingly deceiving the person about the content of the communication in order to incentivize a purchase or sale of goods or services in a commercial transaction or to influence a vote in an election. A person using a bot shall not be liable under this section if the person discloses that it is a bot.

(b) The disclosure required by this section shall be clear, conspicuous, and reasonably designed to inform persons with whom the bot communicates or interacts that it is a bot.

而加州是 Blizzard Entertainment 的總部...

法條上面對「online platform」有設計排除條款,不過如果只算星海二的人數,有可能不到這個豁免限制... 所以得避開而改用歐洲區來測試?

(c) “Online platform” means any public-facing Internet Web site, Web application, or digital application, including a social network or publication, that has 10,000,000 or more unique monthly United States visitors or users for a majority of months during the preceding 12 months.

(c) This chapter does not impose a duty on service providers of online platforms, including, but not limited to, Web hosting and Internet service providers.

美國軍方應該是超級關注這個議題,相較於 AlphaGo 或是 AlphaZero 是資訊完全透明的遊戲,這次要踏入非對稱資訊的遊戲。

如果在這個領域上有成果的話,可以預期未來的戰爭 (yeah 實體戰爭) 會開始大量採用 AI 了...

reCAPTCHA 與語音辨識:以子之矛,攻子之盾

GooglereCAPTCHA 大概是目前最常用的反制機器人工具了,但因為 accessibility 的原因 (而且應該是有法令要求),還是需要提供盲人可以存取的方式,也就是以語音判斷是否是機器人。

unCaptcha2 就是用這塊,加上 Google 自家的語音辨識 API (也支援其他家 API) 可以直接打穿:

現有的程式碼已經先被 Google 反制,但目的是展示出這樣的概念。