蘋果也搞了個 Applebot 掃資料

Hacker News Daily 上翻到的:「About Applebot」,另外 Hacker News 上的討論也蠻有趣的,可以看看:「Applebot (support.apple.com)」。

目前的用途是用在 Siri 之類的 bot:

Applebot is the web crawler for Apple. Products like Siri and Spotlight Suggestions use Applebot.

裡面有提到辨識方式,IP 會使用 17.0.0.0/8,反解會是 *.applebot.apple.com

Traffic coming from Applebot is identified by its user agent, and reverse DNS shows it in the *.applebot.apple.com domain, originating from the 17.0.0.0 net block.

另外 User-agent 也可以看出:

Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko) Version/Safari_version Safari/WebKit_version (Applebot/Applebot_version)

後面有提到 search engine 的部份 (About search rankings),這點讓大家在猜蘋果是不是開始在弄 search engine 了,在 Hacker News 上的討論串裡面可以看到不少對於蘋果自己搞 search engine 的猜測。

然後也有些頗有趣的,像是爆料當初開發的過程遇到的問題 XD

jd20 3 days ago [–]

Some fun facts:
- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).

- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...

Source: I worked on the original version.

MariaDB 的 S3 Engine 效能測試

PerconaMariaDB 在 10.5 (目前的最新穩定版) 裡出的 S3 Engine 給出了簡單的測試報告:「MariaDB S3 Engine: Implementation and Benchmarking」。

這個 engine 顧名思義就是把資料丟到 Amazon S3 上,目前是 alpha 版本,預設是不會載入的,需要開 alpha flag 才能用:

The S3 engine is READ_ONLY so you can’t perform any write operations ( INSERT/UPDATE/DELETE ), but you can change the table structure.

另外這是從 Aria 改出來的 read-only engine,而 Aria 是從 MyISAM 改出來的:

The S3 storage engine is based on the Aria code and the main feature is that you can directly move your table from a local device to S3 using ALTER.

測出來發現在 read-only 的情境下,COUNT(*) 超快,看起來就是跟 MyISAM 體系有關,直接撈 MyISAM 內的資料,所以本地要 18 秒,但放到 S3 反而秒殺 XDDD

整體看起來還不錯?算是一種 Data warehouse 的方案,主要是要用到 row-based format 儲存的優點,遇到一些冷資料可以這樣玩。

從「Using the S3 Storage Engine」這邊的設定方式看到 s3_host_name,看起來有機會接其他家的 S3 API,或是本地的 Storage。

話說 Aria 這個引擎當初最主要的重點就在 crash-safe,在有了 crash-safe 之後,DRBD 這種 block-level replication 機制就可以硬幹上去,後來主力就在擴充其他型態了,像是 GIS 與 virtual column 的功能,不過這些功能本家在 InnoDB 上好像也都陸陸續續跟上來了,單純的 Aria engine 好像還好...

Google 的搜尋廣告改版造成的混淆

Google 的搜尋廣告最近改版了,在 The Verge 的「Google’s ads just look like search results now」這邊可以看到報導以及 screenshot:

可以看到廣告的標示變成 favicon 了,使得使用者更容易誤會是搜尋內容。而這也使得廣告的點閱比例大幅提昇,像是「Google’s latest search results change further blurs what’s an ad」這邊提到的:

For all four clients (a local health care company, two business-to-business companies and an e-commerce company), the desktop click-through rates increased and ranged from 4% to 10.5%. All clients had slight declines in the click-through rates on mobile devices.

The Verge 後續也分析了這個改變帶來的反思:「How much longer will we trust Google’s search results?」。

我的建議是 uBlock Origin 當作基本工具 (在各瀏覽器上應該都有支援),另外進階一些可以用 DuckDuckGo 看看,但不保證搜尋品質會讓你滿意...

MySQL (InnoDB) 的內部狀態

Percona 老大 Peter Zaitsev 在「MySQL – A Series of Bad Design Decisions」這篇裡提到了他認為 MySQL 設計上的問題,不過裡面也提到了不少有用的指令,平常可以先熟悉一下輸出,等真的有狀況的時候才會想起來可以用這些指令。

首先是最經典的 SHOW ENGINE INNODB STATUS,算是很多文件上面都會提到的指令。可以看 InnoDB 當下的情況,藉以猜測內部現在是怎麼卡住...

另外一個是 SHOW ENGINE INNODB MUTEX,就如同 Peter Zaitsev 所提到的,這個指令想辦法抓出最重要的資訊,但不要像 SHOW ENGINE INNODB STATUS 給了那麼多。

另外當然就是 INFORMATION_SCHEMA,他甚至希望 SHOW ENGINE INNODB 系列的指令應該要被整合進去,這樣才能用 SELECT 相關的指令整理... (因為 ORDER BY 以及蠻多的指令沒辦法在 SHOW ... 上面用)

企業內的文件搜尋系統 Amazon Kendra

AWS 推出了具有語意分析的能力,可以直接丟自然語言進去搜尋的 Amazon Kendra:「Announcing Amazon Kendra: Reinventing Enterprise Search with Machine Learning」。

之前 Google 有推出過 Google Search Appliance 也是做企業內資料的整合 (2016 年收掉了),但應該沒有到可以用自然語言搜尋?

Amazon Kendra 的費用不算便宜,Enterprise Edition 提供 150GB 的容量與 50 萬筆文件,然後提供大約 40k query/day,這樣要 USD$7/hr,一個月大約是 USD$5,040,不過對於企業來說應該是很有用...

另外有提到這邊 query 收費的部份是估算,會依照 query 問題的難易度而不同:

Actual queries per day will vary based on query complexity, which greatly varies from customer to customer. Less complex queries (e.g. “leave policy”) consume less resources to run, and more complex queries (e.g. “What’s the daily parking allowance in Seattle?”) consume more resources to run. The total number of queries you can run with your allocated resources will depend on your mix of queries. The max queries per day provided above is an estimate, assuming 80% less complex queries and 20% more complex queries.

這樣頗有趣的,感覺可以處理簡單的分析了?

Amazon Elasticsearch Service 可以利用 S3 當作二級儲存空間了

Amazon Elasticsearch Service 的新功能,使用 Amazon S3 當作第二級儲存空間 (UltraWarm):「Announcing UltraWarm (Preview) for Amazon Elasticsearch Service」。

UltraWarm 需要不同的機器 (跑不同版本?),機器的規格 (vCPU 與記憶體的比率) 接近 Memory Optimized 的版本,但是貴了不少,所以需要夠大的資料量才會打平回來...

us-east-1 來看,SSD EBS 的空間成本就是 USD$0.135/GB,而傳統磁性硬碟是 USD$0.067/GB (不知道收不收 I/O 費用?),但 storage 的價錢是 USD$0.024/GB。這邊值得一提的是 Amazon S3 是 USD$0.023/GB,看起來是直接包括了 API 的呼叫費用?

Startpage 被廣告公司收購

Hacker News 上看到 Reddit 上的消息 (看起來有陣子了):「Startpage is now owned by an advertising company」。

Startpage 算是之前有在用的 default search engine,但發現有很多 bug 後就不太用了。目前還是先設 DuckDuckGo,然後在需要的時候用之前寫的 press-g-to-google-duckduckgo 切到 Google 去找...

DuckDuckGo 還是有搜尋品質的問題...

hiQ 爬 LinkedIn 資料的無罪判決

hiQ 之前爬 LinkedIn 的公開資料而被 LinkedIn 告 (可以參考 2017 時的「hiQ prevails / LinkedIn must allow scraping / Of your page info」),這場官司一路打官司打到第九巡迴庭,最後的判決確認了 LinkedIn 完全敗訴。判決書在「HIQ LABS V. LINKEDIN」這邊可以看到。

這次的判決書有提到當初地方法院有下令 LinkedIn 不得用任何方式設限抓取公開資料:

The district court granted hiQ’s motion. It ordered LinkedIn to withdraw its cease-and-desist letter, to remove any existing technical barriers to hiQ’s access to public profiles, and to refrain from putting in place any legal or technical measures with the effect of blocking hiQ’s access to public profiles. LinkedIn timely appealed.

而在判決書裡其他地方也可以看到巡迴庭不斷確認地方法院當時的判決是合理的,並且否定 LinkedIn 的辯解:(這邊只拉了兩段,裡面還有提到很多次)

In short, the district court did not abuse its discretion in concluding on the preliminary injunction record that hiQ currently has no viable way to remain in business other than using LinkedIn public profile data for its Keeper and Skill Mapper services, and that HiQ therefore has demonstrated a likelihood of irreparable harm absent a preliminary injunction.

We conclude that the district court’s determination that the balance of hardships tips sharply in hiQ’s favor is not “illogical, implausible, or without support in the record.” Kelly, 878 F.3d at 713.

到巡迴庭差不多是確定的判決了,沒有其他特別的流程的話...

Trac 1.4

剛剛看到 Trac 出了 1.4 的消息,雖然還是在用 Python 2:「Release Notes for Trac 1.4 Jinja Release」。

其中一個比較大的改變是 template engine 換掉了,從 Edgewall 自家的 Genshi 換到 Jinja2 上,目前看起來兩者都可以用,但可以預期現有的 template 要找時間換過去:「5 times speed-up when rendering query results, thanks to the migration from Genshi to Jinja2.」。

其他的改變應該還好,找個週末來把自己的 Trac 備份起來測試好了...

GCE 的 IP 要收費了...

收到信件通知,本來在 GCE 上使用的 Public IP address 是免費的,2020 年開始變成要收 USD$0.004/hr (Standard,約 USD$2.88/month) 或是 USD$0.002/hr (Preemptible,約 USD$1.44/month):

First, we’re increasing the price for Google Compute Engine (GCE) VMs that use external IP addresses. Beginning January 1, 2020, a standard GCE instance using an external IP address will cost an additional $0.004/hr and a preemptible GCE instance using an external IP address will cost an additional $0.002/hr.

從 2020 年一月開始生效,但是前三個月會用 100% discount 的方式呈現在帳單上 (所以還是免費),這樣你會知道你的 IP address 費用會吃多少錢:

We will fully discount any external IP usage for the first 3 months to help you quantify the impact of these pricing changes. Please take note of the following dates:

January 1, 2020: Although your invoice will show your calculated external IP-related charges, these will be fully discounted and you will not need to pay these.
April 1, 2020: You will need to pay for any incurred external IP-related charges shown on your invoice.

其實整體成本應該是還好,但看到漲價總是不開心... XD