蘋果也搞了個 Applebot 掃資料

Hacker News Daily 上翻到的:「About Applebot」,另外 Hacker News 上的討論也蠻有趣的,可以看看:「Applebot (support.apple.com)」。

目前的用途是用在 Siri 之類的 bot:

Applebot is the web crawler for Apple. Products like Siri and Spotlight Suggestions use Applebot.

裡面有提到辨識方式,IP 會使用,反解會是 *.applebot.apple.com

Traffic coming from Applebot is identified by its user agent, and reverse DNS shows it in the *.applebot.apple.com domain, originating from the net block.

另外 User-agent 也可以看出:

Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko) Version/Safari_version Safari/WebKit_version (Applebot/Applebot_version)

後面有提到 search engine 的部份 (About search rankings),這點讓大家在猜蘋果是不是開始在弄 search engine 了,在 Hacker News 上的討論串裡面可以看到不少對於蘋果自己搞 search engine 的猜測。

然後也有些頗有趣的,像是爆料當初開發的過程遇到的問題 XD

jd20 3 days ago [–]

Some fun facts:
- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).

- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...

Source: I worked on the original version.

Cloudflare 改用自己的 CAPTCHA 服務 hCaptcha

CloudflareGooglereCAPTCHA 改用自家的 hCaptcha:「Moving from reCAPTCHA to hCaptcha」。

看起來其實就是錢的問題,reCAPTCHA 要收費了,而以 Cloudflare 的量會太貴:

Earlier this year, Google informed us that they were going to begin charging for reCAPTCHA. That is entirely within their right. Cloudflare, given our volume, no doubt imposed significant costs on the reCAPTCHA service, even for Google.

另外 hCaptcha 有提供免費版本給一般網站用,剛出來這幾天等白老鼠寫心得後,再決定要不要跳進去測試...

cdnjs 轉移到 Cloudflare 負責維護

不確定是 cdnjs 還是 CDNJS,因為官方網站是小寫,但 GitHub 上是大寫...

Anyway,cdnjs 本來由社群維護更新 (實際上是透過 bot 更新,但 bot 本身也需要維護),因為人力時間的因素,轉移給 Cloudflare 負責了:「An Update on CDNJS」。

這次也更新了 cdnjs 的 daily request 數量,可以看到現在大約是每天六十億次:

本來 Cloudflare 是站在贊助頻寬的角色提供服務:

Within Cloudflare’s infrastructure there is a set of machines which are responsible for pulling the latest version of the repo periodically. Those machines then become the origin for cdnjs.cloudflare.com, with Cloudflare’s Global Load Balancer automatically handling failures. Cloudflare’s cache automatically stores copies of many of the projects making it possible for us to deliver them quickly from all 195 of our data centers.

但更新的 bot 本身掛了,而且維護者沒時間修:

Unfortunately approximately thirty days ago one of those bots stopped working, preventing updated projects from appearing in CDNJS. The bot's open-source maintainer was not able to invest the time necessary to keep the bot running. After several weeks we were asked by the community and the CDNJS founders to take over maintenance of the CDNJS repo itself.

所以現在則是 Cloudflare 接手維護了:

This means the Cloudflare engineering team is taking responsibility for keeping the contents of github.com/cdnjs/cdnjs up to date, in addition to ensuring it is correctly served on cdnjs.cloudflare.com.

不過裡面也提到了一個問題,就是現在瀏覽器為了安全性,對於不同的站台會有不同的 cache,本來 cdnjs 的設計目的之一被大幅削弱,現在只剩下省頻寬了:

The future value of CDNJS is now in doubt, as web browsers are beginning to use a separate cache for every website you visit. It is currently used on such a wide swath of the web, however, it is unlikely it will be disappearing any time soon.

用 PoW 當作防機器人的方式

看到「wehatecaptchas」這個服務試著用 PoW (Proof of work) 擋機器人...

這個方式不需要像 GooglereCAPTCHA 那樣蒐集大量行為 (對隱私不利),也不需要解一堆奇怪的圖片問題。

CAPTCHA 最常用的領域,也就是擋 spam 這件事情來說,PoW 這樣的單一方式應該是不夠,但可以當作綜合方法裡面的一種...

AI 版的星海爭霸二將直接透過歐洲區的 Battle.net 匿名與人類對戰

前幾天 Blizzard 公佈的消息,DeepMind 的星海爭霸二 AI (AlphaStar) 將會透過 Blizzard 的 Battle.net 歐洲區伺服器跟人類對戰:「DeepMind Research on Ladder」。

Experimental versions of DeepMind’s StarCraft II agent, AlphaStar, will soon play a small number of games on the competitive ladder in Europe as part of ongoing research into AI.


If you would like the chance to help DeepMind with its research by matching against AlphaStar, you can opt in by clicking the “opt-in” button on the in-game popup window. You can alter your opt-in selection at any time by using the “DeepMind opt-in” button on the 1v1 Versus menu.

但你仍然不會知道對手是人還是 AI,而且如同一般對戰情況,這會影響到你的戰績:

For scientific test purposes, DeepMind will be benchmarking AlphaStar’s performance by playing anonymously during a series of blind trial matches. This means the StarCraft community will not know which matches AlphaStar is playing, to help ensure that all games are played under the same conditions. AlphaStar plays with built-in restrictions that the DeepMind team has defined in consultation with pro players. A win or a loss against AlphaStar will affect your MMR as normal.


加州從今年七月開始,禁止 AI 偽裝成人類 (前幾天也有一些新聞在報導):「A California law now means chatbots have to disclose they’re not human」,對應的法條在「Bill Text - SB-1001 Bots: disclosure」這邊可以看到:

17941. (a) It shall be unlawful for any person to use a bot to communicate or interact with another person in California online, with the intent to mislead the other person about its artificial identity for the purpose of knowingly deceiving the person about the content of the communication in order to incentivize a purchase or sale of goods or services in a commercial transaction or to influence a vote in an election. A person using a bot shall not be liable under this section if the person discloses that it is a bot.

(b) The disclosure required by this section shall be clear, conspicuous, and reasonably designed to inform persons with whom the bot communicates or interacts that it is a bot.

而加州是 Blizzard Entertainment 的總部...

法條上面對「online platform」有設計排除條款,不過如果只算星海二的人數,有可能不到這個豁免限制... 所以得避開而改用歐洲區來測試?

(c) “Online platform” means any public-facing Internet Web site, Web application, or digital application, including a social network or publication, that has 10,000,000 or more unique monthly United States visitors or users for a majority of months during the preceding 12 months.

(c) This chapter does not impose a duty on service providers of online platforms, including, but not limited to, Web hosting and Internet service providers.

美國軍方應該是超級關注這個議題,相較於 AlphaGo 或是 AlphaZero 是資訊完全透明的遊戲,這次要踏入非對稱資訊的遊戲。

如果在這個領域上有成果的話,可以預期未來的戰爭 (yeah 實體戰爭) 會開始大量採用 AI 了...

reCAPTCHA 與語音辨識:以子之矛,攻子之盾

GooglereCAPTCHA 大概是目前最常用的反制機器人工具了,但因為 accessibility 的原因 (而且應該是有法令要求),還是需要提供盲人可以存取的方式,也就是以語音判斷是否是機器人。

unCaptcha2 就是用這塊,加上 Google 自家的語音辨識 API (也支援其他家 API) 可以直接打穿:

現有的程式碼已經先被 Google 反制,但目的是展示出這樣的概念。

Googlebot 的 Math.random()

Hacker News Daily 上看到「Googlebot’s Javascript random() function is deterministic」這則有趣的發現。作者發現 Googlebot 的 Math.random() 並不隨機,甚至是固定的:

The first time Googlebot calls Math.random() the result will always be 0.14881141134537756, the second call will always be 0.19426893815398216. The script I linked to above simply uses this fact but obfuscates it a little and ‘seed’ it with something that doesn’t look too arbitrary.

需要無法預測的 random number (有安全性需求的) 應該用 RandomSource.getRandomValues() 這類函數,而不是用 Math.random(),所以這點倒是還好...

Googlebot 的 Web rendering service 的細節

在「Polymer 2 and Googlebot」這邊文章裡面才看到 Google 官方在今年八月就有公開 Googlebot 所使用的 Web rendering service (WRS) 的細節:「Rendering on Google Search」。可以想像到是基於 Google Chrome 的修改:

Googlebot uses a web rendering service (WRS) that is based on Chrome 41 (M41). Generally, WRS supports the same web platform features and capabilities that the Chrome version it uses — for a full list refer to chromestatus.com, or use the compare function on caniuse.com.

裡面提到一些值得注意的事情,像是不支援 WebSocket,所以對於考慮 Google 搜尋結果的頁面來說,就要注意錯誤處理了...

Facebook 也參戰參與 StarCraft 的 AI 測試,只是成績不太好...

Facebook 也參與了 StarCraft 的 AI 測試,不過成績不太好:「Facebook Quietly Enters StarCraft War for AI Bots, and Loses」。

不過跟 DeepMind 投入的項目不太一樣,Facebook 投入的是 StarCraft,DeepMind 投入的是 StarCraft II...

比賽有二十八組,Facebook 拿下第六名,但前三名都只是獨立愛好者:

Final results released Sunday indicate Facebook still has a way to go: CherryPi finished sixth in a field of 28; the top three bots were all made by lone, hobbyist coders.


分析 FCC 對網路中立性的留言,將鄉民與機器人分開來分析

Boing Boing 的立場其實還蠻鮮明的,所以有時候他們的新聞看看就好...

但這篇真的很有趣,把 FCC (美國的聯邦通信委員會) 上兩千兩百萬則對網路中立性的留言拿出來分析,結果發現真人與機器人的差異超明顯 XDDD:「Analysis of 22 million FCC comments show that humans love Net Neutrality and bots really, really hate it」,引用的文章在「Discovering truth through lies on the internet - FCC comments analyzed」這邊。

分析可以發現,真人偏好網路中立,而機器人反對網路中立 XDDD:(其實大家心裡都有底是怎麼玩出來的... 只是這次有機會分析,讓事情更明顯)

Data analysis company Gravwell ingested 22,000,000 comments sent to the FCC's docket on Net Neutrality and posted their preliminary findings, which are that the majority of comments came from bots, and these bots oppose Net Neutrality; of the comments that appear to originate with humans, the vast majority favor Net Neutrality.

文章中列出幾個有趣的現象,像是機器人的 comment 大量重複:

A very small minority of comments are unique -- only 17.4% of the 22,152,276 total. The highest occurrence of a single comment was over 1 million.

甚至是 pornhub.com 的郵件位置 XDDD:

Most comments were submitted in bulk and many come in batches with obviously incorrect information -- over 1,000,000 comments in July claimed to have a pornhub.com email address

然後機器人的 pattern 也很容易辨別:

Bot herders can be observed launching the bots -- there are submissions from people living in the state of "{STATE}" that happen minutes before a large number of comment submissions

這個有點類似「50 Cent Party (五毛黨)」,但是是自動化機器人,而且產出的「品質」不太好 XDDD