AWS 跳出來決定繼續搞 Elasticsearch 了

先前提到「Elasticsearch 與 Kibana 也變成非 Open Source 軟體」,後來 Elastic 的 CEO (創辦人) 發了一篇「Amazon: NOT OK - why we had to change Elastic licensing」直接批評 AWS

接下來是 AWS 跳出來放話了,基本上也是個新聞稿:「Stepping up for a truly open source Elasticsearch」,大概就是會繼續維護自己的版本,維持本來的 Apache License, Version 2.0,然後批評 Elastic 所說的話不實之類的...

現在還在雙方放話的階段,過一陣子看看有什麼更新...

Elasticsearch 與 Kibana 也變成非 Open Source 軟體

Nuzzel 上看到的消息,ElasticsearchKibana 也變成非 Open Source 軟體了:「Elasticsearch and Kibana are now business risks」,官方的公告在「Upcoming licensing changes to Elasticsearch and Kibana」這邊。

新版將會採用 SSPL (由 MongoDB 設計出來的授權) 與 Elastic License (Elastic 的商用授權) 的雙重授權,不過兩個授權都不是 Open Source 授權。

應該是跟 Amazon Elasticsearch Service 這種搞法加減有些關係?不知道 AWS 這邊後續會怎麼弄...

另外如果不選擇 Elasticsearch 的話,目前好像只有 Solr 算是堪用?不過很久沒回去看 Solr,不知道現在軟體發展到什麼程度...

蘋果也搞了個 Applebot 掃資料

Hacker News Daily 上翻到的:「About Applebot」,另外 Hacker News 上的討論也蠻有趣的,可以看看:「Applebot (support.apple.com)」。

目前的用途是用在 Siri 之類的 bot:

Applebot is the web crawler for Apple. Products like Siri and Spotlight Suggestions use Applebot.

裡面有提到辨識方式,IP 會使用 17.0.0.0/8,反解會是 *.applebot.apple.com

Traffic coming from Applebot is identified by its user agent, and reverse DNS shows it in the *.applebot.apple.com domain, originating from the 17.0.0.0 net block.

另外 User-agent 也可以看出:

Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko) Version/Safari_version Safari/WebKit_version (Applebot/Applebot_version)

後面有提到 search engine 的部份 (About search rankings),這點讓大家在猜蘋果是不是開始在弄 search engine 了,在 Hacker News 上的討論串裡面可以看到不少對於蘋果自己搞 search engine 的猜測。

然後也有些頗有趣的,像是爆料當初開發的過程遇到的問題 XD

jd20 3 days ago [–]

Some fun facts:
- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).

- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...

Source: I worked on the original version.

Brave 出手檢舉 Google 沒有遵守 GDPR

Brave (從 Chromium 分支出來的瀏覽器) 檢舉 Google 沒有遵守 GDPR 的規定:「Formal GDPR complaint against Google’s internal data free-for-all」。

主要是「purpose limitation」這個部份,出自「REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016」:

1. Personal data shall be:

(b)

collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes; further processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes shall, in accordance with Article 89(1), not be considered to be incompatible with the initial purposes (‘purpose limitation’);

比較重要的是 specified 與 explicit 這兩個詞,GDPR 規定必須明確指明用途,而可以從整理出來的文件「Inside the black box」裡的「Purported processing purpose」看到大量的極為廣泛的說明。

Google 應該會就這塊反擊認為這樣的描述就夠用,就看歐盟決定要怎麼做了...

Google 的搜尋廣告改版造成的混淆

Google 的搜尋廣告最近改版了,在 The Verge 的「Google’s ads just look like search results now」這邊可以看到報導以及 screenshot:

可以看到廣告的標示變成 favicon 了,使得使用者更容易誤會是搜尋內容。而這也使得廣告的點閱比例大幅提昇,像是「Google’s latest search results change further blurs what’s an ad」這邊提到的:

For all four clients (a local health care company, two business-to-business companies and an e-commerce company), the desktop click-through rates increased and ranged from 4% to 10.5%. All clients had slight declines in the click-through rates on mobile devices.

The Verge 後續也分析了這個改變帶來的反思:「How much longer will we trust Google’s search results?」。

我的建議是 uBlock Origin 當作基本工具 (在各瀏覽器上應該都有支援),另外進階一些可以用 DuckDuckGo 看看,但不保證搜尋品質會讓你滿意...

Avast 與 Jumpshot 販賣使用者瀏覽記錄與行為

過了一陣子了,可以整理一下資料記錄起來...

報導可以看 PCMag 的「The Cost of Avast's Free Antivirus: Companies Can Spy on Your Clicks」與 Motherboard (VICE) 的「Leaked Documents Expose the Secretive Market for Your Web Browsing Data」這兩篇,大綱先把重點列出來了,Avast 在賣使用者的瀏覽記錄與行為:

Avast is harvesting users' browser histories on the pretext that the data has been 'de-identified,' thus protecting your privacy. But the data, which is being sold to third parties, can be linked back to people's real identities, exposing every click and search they've made.

Avast 利用免費的防毒軟體,蒐集使用者的瀏覽記錄與行為,然後透過 Jumpshot 這家子公司販賣出去:

The Avast division charged with selling the data is Jumpshot, a company subsidiary that's been offering access to user traffic from 100 million devices, including PCs and phones.

算是「免費的最貴」的標準型。另外比較有趣的是「資料賣給了誰」這件事情:

Who else might have access to Jumpshot's data remains unclear. The company's website says it's worked with other brands, including IBM, Microsoft, and Google. However, Microsoft said it has no current relationship with Jumpshot. IBM, on the other hand, has "no record" of being a client of either Avast or Jumpshot. Google did not respond to a request for comment.

Microsoft 說「現在沒有關係」,IBM 說「沒有 client 的記錄」,Google 則是不回應。

然後配合解釋資料長什麼樣子,以及可以怎麼用:

For instance, a single click can theoretically look like this:

Device ID: abc123x Date: 2019/12/01 Hour Minute Second: 12:03:05 Domain: Amazon.com Product: Apple iPad Pro 10.5 - 2017 Model - 256GB, Rose Gold Behavior: Add to Cart

At first glance, the click looks harmless. You can't pin it to an exact user. That is, unless you're Amazon.com, which could easily figure out which Amazon user bought an iPad Pro at 12:03:05 on Dec. 1, 2019. Suddenly, device ID: 123abcx is a known user. And whatever else Jumpshot has on 123abcx's activity—from other e-commerce purchases to Google searches—is no longer anonymous.

所以,如果 Google 手上有這個資料,就可以交叉比對自家的記錄,然後得到使用者完整的記錄。

在消息一公開後沒多久後,Avast 就宣佈關閉 Jumpshot,感覺連被抓包後的反應動作都超流暢,一臉就是排練過:「A message from Avast CEO Ondrej Vlcek」。

看了一下,Avast 旗下還有 AVG,還有個 VPN 服務...

企業內的文件搜尋系統 Amazon Kendra

AWS 推出了具有語意分析的能力,可以直接丟自然語言進去搜尋的 Amazon Kendra:「Announcing Amazon Kendra: Reinventing Enterprise Search with Machine Learning」。

之前 Google 有推出過 Google Search Appliance 也是做企業內資料的整合 (2016 年收掉了),但應該沒有到可以用自然語言搜尋?

Amazon Kendra 的費用不算便宜,Enterprise Edition 提供 150GB 的容量與 50 萬筆文件,然後提供大約 40k query/day,這樣要 USD$7/hr,一個月大約是 USD$5,040,不過對於企業來說應該是很有用...

另外有提到這邊 query 收費的部份是估算,會依照 query 問題的難易度而不同:

Actual queries per day will vary based on query complexity, which greatly varies from customer to customer. Less complex queries (e.g. “leave policy”) consume less resources to run, and more complex queries (e.g. “What’s the daily parking allowance in Seattle?”) consume more resources to run. The total number of queries you can run with your allocated resources will depend on your mix of queries. The max queries per day provided above is an estimate, assuming 80% less complex queries and 20% more complex queries.

這樣頗有趣的,感覺可以處理簡單的分析了?

Amazon Elasticsearch Service 可以利用 S3 當作二級儲存空間了

Amazon Elasticsearch Service 的新功能,使用 Amazon S3 當作第二級儲存空間 (UltraWarm):「Announcing UltraWarm (Preview) for Amazon Elasticsearch Service」。

UltraWarm 需要不同的機器 (跑不同版本?),機器的規格 (vCPU 與記憶體的比率) 接近 Memory Optimized 的版本,但是貴了不少,所以需要夠大的資料量才會打平回來...

us-east-1 來看,SSD EBS 的空間成本就是 USD$0.135/GB,而傳統磁性硬碟是 USD$0.067/GB (不知道收不收 I/O 費用?),但 storage 的價錢是 USD$0.024/GB。這邊值得一提的是 Amazon S3 是 USD$0.023/GB,看起來是直接包括了 API 的呼叫費用?

Google 搜尋無法使用 Lynx 或是 w3m 操作了

看到「No more google for console junkies」這篇,裡面提到了新版的 Google 沒辦法用 Lynx 操作了,拿 w3m 測了一下發現也不行了,可以搜到東西,但連結的操作已經是 JavaScript 化了,而這兩個瀏覽器都不支援 JavaScript,所以就卡住了...

是個從早年的 Unobtrusive JavaScript 概念,到現在沒有 JavaScript 就不會動的年代...

有翻到一些有支援 JavaScript 的 terminal web browser (LinksELinks),但都只是實驗品,連輸入中文都有問題... :/