hiQ 爬 LinkedIn 資料的無罪判決

hiQ 之前爬 LinkedIn 的公開資料而被 LinkedIn 告 (可以參考 2017 時的「hiQ prevails / LinkedIn must allow scraping / Of your page info」),這場官司一路打官司打到第九巡迴庭,最後的判決確認了 LinkedIn 完全敗訴。判決書在「HIQ LABS V. LINKEDIN」這邊可以看到。

這次的判決書有提到當初地方法院有下令 LinkedIn 不得用任何方式設限抓取公開資料:

The district court granted hiQ’s motion. It ordered LinkedIn to withdraw its cease-and-desist letter, to remove any existing technical barriers to hiQ’s access to public profiles, and to refrain from putting in place any legal or technical measures with the effect of blocking hiQ’s access to public profiles. LinkedIn timely appealed.

而在判決書裡其他地方也可以看到巡迴庭不斷確認地方法院當時的判決是合理的,並且否定 LinkedIn 的辯解:(這邊只拉了兩段,裡面還有提到很多次)

In short, the district court did not abuse its discretion in concluding on the preliminary injunction record that hiQ currently has no viable way to remain in business other than using LinkedIn public profile data for its Keeper and Skill Mapper services, and that HiQ therefore has demonstrated a likelihood of irreparable harm absent a preliminary injunction.

We conclude that the district court’s determination that the balance of hardships tips sharply in hiQ’s favor is not “illogical, implausible, or without support in the record.” Kelly, 878 F.3d at 713.

到巡迴庭差不多是確定的判決了,沒有其他特別的流程的話...

引用自己論文的問題...

Nature 上點出來期刊論文裡自我引用的問題 (這邊的自我引用包括了合作過的人):「Hundreds of extreme self-citing scientists revealed in new database」。

開頭舉了一個極端的例子,Vaidyanathan 的自我引用比率高達 94%,而學界的中位數是 12.7%,感覺是有某種制度造成的行為?

Vaidyanathan, a computer scientist at the Vel Tech R&D Institute of Technology, a privately run institute, is an extreme example: he has received 94% of his citations from himself or his co-authors up to 2017, according to a study in PLoS Biology this month. He is not alone. The data set, which lists around 100,000 researchers, shows that at least 250 scientists have amassed more than 50% of their citations from themselves or their co-authors, while the median self-citation rate is 12.7%.

會想要提是因為想到當年 Google 的經典演算法 PageRank,就是在處理這個問題... 把 paper 換成 webpage 而已。

Facebook 修正錯字的新演算法

先前 Facebook 已經先發表過 fastText 了,在這個月的月初又發表了另外一個演算法 Misspelling Oblivious Embeddings (MOE),是搭著本來的 fastText 而得到的改善:「A new model for word embeddings that are resilient to misspellings」。

Facebook 的說明提到在 user-generated text 的內容上,MOE 的效果比 fastText 好:

We checked the effectiveness of this approach considering different intrinsic and extrinsic tasks, and found that MOE outperforms fastText for user-generated text.

論文發表在 arXiv 上:「Misspelling Oblivious Word Embeddings」。

依照介紹,fastText 的重點在於 semantic loss,而 MOE 則多了 spell correction loss:

The loss function of fastText aims to more closely embed words that occur in the same context. We call this semantic loss. In addition to the semantic loss, MOE also considers an additional supervisedloss that we call spell correction loss. The spell correction loss aims to embed misspellings close to their correct versions by minimizing the weighted sum of semantic loss and spell correction loss.

不過目前 GitHub 上的 facebookresearch/moe 只有放 dataset,沒有 open source 出來讓人直接用,可能得自己刻...

透過 Avast 防毒軟體蒐集資料的 Jumpshot

看到「Less than Half of Google Searches Now Result in a Click」這篇,在說明 Google 的搜尋結果頁面內的行為大幅偏頗 Google 自家服務的問題,這個問題最近幾個禮拜開始紅了起來...

但另外一點值得注意的是裡面提到 Jumpshot 這個服務可以分析使用者的頁面以及行為這件事情...

在 2013 年 Avast 買下 Jumpshot:「AVAST Software Acquires Jumpshot to Work Magic Against Slow PC Performance」,當時的目標是效能:

Having served as PC tech consultants to their friends and family, their goal was to build a product to help less tech-savvy PC users optimize and tune up their PC performance, cleaning it from unpleasant toolbars and junk software.

但在 2015 年的時候就可以看到 Avast 在他們自家的論壇上有說明,Avast 會收資料丟進 Jumpshot:「Avast and Jumpshot」。

These aggregated results are the only thing that Avast makes available to Jumpshot customers and end users.

而藉由這些資料而提供服務。

Amazon 的 Elasticsearch 服務提供十四天免費 hourly snapshot

Amazon Elasticsearch Service 提供 14 天免費的 hourly snapshot:「Amazon Elasticsearch Service increases data protection with automated hourly snapshots at no extra charge」。

Amazon Elasticsearch Service has increased its snapshot frequency from daily to hourly, providing more granular recovery points. If you need to restore your cluster, you now have numerous, recent snapshots to choose from. These automated snapshots are retained for 14 days at no extra charge.

不過這是 5.3+ 版本才有,舊版只有 daily:

  • For domains running Elasticsearch 5.3 and later, Amazon ES takes hourly automated snapshots and retains up to 336 of them for 14 days.
  • For domains running Elasticsearch 5.1 and earlier, Amazon ES takes daily automated snapshots (during the hour you specify) and retains up to 14 of them for 30 days.

In both cases, the service stores the snapshots in a preconfigured Amazon S3 bucket at no additional charge. You can use these automated snapshots to restore domains.

算是方便管理...

robots.txt 的標準化

雖然聽起來有點詭異,但 robots.txt 的確一直都只是業界慣用標準,而非正式標準,所以各家搜尋引擎加加減減都有一些自己的參數。

在經過這麼久以後,Google 決定推動 robots.txt 的標準化:「Formalizing the Robots Exclusion Protocol Specification」,同時 Google 也放出了他們解讀 robots.txt 的 parser:「Google's robots.txt Parser is Now Open Source」,在 GitHubgoogle/robotstxt 這邊可以取得。

目前的 draft 是 00 版,可以在 draft-rep-wg-topic-00 這邊看到,不知道其他搜尋引擎會給什麼樣的回饋...

Elasticsearch 提供免費版本的安全功能

Elasticsearch 決定將基本的安全功能從付費功能轉為免費釋出,很明顯的是受到 Open Distro for Elasticsearch 的壓力而做出的改變:「Security for Elasticsearch is now free」。

要注意的是這不是 open source 版本,只是將這些功能放到 basic tier 裡讓使用者免費使用:

Previously, these core security features required a paid Gold subscription. Now they are free as a part of the Basic tier. Note that our advanced security features — from single sign-on and Active Directory/LDAP authentication to field- and document-level security — remain paid features.

這代表 Open Distro for Elasticsearch 提供的還是比較多:

With Open Distro for Elasticsearch, you can leverage your existing authentication infrastructure such as LDAP/Active Directory, SAML, Kerberos, JSON web tokens, TLS certificates, and Proxy authentication/SSO for user authentication. An internal user repository with support for basic HTTP authentication is also avaliable for easy setup and evaluation.

Granular, role-based access control enables you to control the actions a user can perform on your Elasticsearch cluster. Roles control cluster operations, access to indices, and even the fields and documents users can access. Open Distro for Elasticsearch also supports multi-tenant environments, allowing multiple teams to share the same cluster while only being able to access their team's data and dashboards.

目前看起來還是可以朝 Open Distro for Elasticsearch 靠過去...

改 Open Distro for Elasticsearch 預設密碼的方式

AWS 弄出來的 Open Distro for Elasticsearch 因為內建了安全性的功能 (參考「AWS 對 Elastic Stack 實作免費的開源版本 Open Distro for Elasticsearch」),應該是目前新架設 Elasticsearch 的首選。

不過裝好後預設有五個帳號,但從 Open Distro 的 Kibana 介面無法修改改其中兩個使用者的密碼 (adminkibanaserver),要修改密碼發現得花不少功夫,不知道為什麼要這樣設計 :/

這邊講的是透過 RPM (以及 deb) 的方式的修改方式,如果是 Docker 的方式請參考後面提到在 AWS blog 上的文章:「Change your Admin Passwords in Open Distro for Elasticsearch」。

首先先透過 hash.sh 產生 bcrypt 的 hash,像是這樣 (輸入 password 當密碼):

bash /usr/share/elasticsearch/plugins/opendistro_security/tools/hash.sh
WARNING: JAVA_HOME not set, will use /usr/bin/java
[Password:]
$2y$12$QchkpgY8y/.0TL7wruWq4egBDrlpIaURiBYKoZD50G/twdylgwuhG

然後修改 /usr/share/elasticsearch/plugins/opendistro_security/securityconfig/internal_users.yml 檔案裡面的值,順便改掉 readonly 的部分。

接下來是把這個 internal_users.yml 檔案的設定更新到 Elasticsearch 裡。由於這邊需要讀 /etc/elasticsearch/ 的東西,所以偷懶用 root 跑:

sudo bash /usr/share/elasticsearch/plugins/opendistro_security/tools/securityadmin.sh -cd ../securityconfig/ -icl -nhnv -cacert /etc/elasticsearch/root-ca.pem -cert /etc/elasticsearch/kirk.pem -key /etc/elasticsearch/kirk-key.pem

做完後可能要重跑 Elasticsearch 與 Kibana:

sudo service elasticsearch restart
sudo service kibana restart

或是重開機... 順便測試看看重開後有沒有生效。理論上修改完成後,就是用新的帳號密碼連到 Kibana。

上面的方法是參考了「Default Password Reset」(先找到這篇) 與「change admin password」(後來在 AWS blog 的文章上發現的 GitHub issue 連結) 這邊的資訊。

官方的說明文件則是在寫這篇文章時才找到的,平常搜尋時不太會出現:「Apply configuration changes」。

Googlebot 會用新版的 Chrome 跑 JavaScript 了

Googlebot 先前一直都是用 Chrome 41 版的引擎在跑,如果要考慮 SEO (for JavaScript),就得確認網站在 Chrome 41 上面可以執行 (於是 ES6 就...)。

今天從 John ResigTwitter 上看到 Googlebot 更新了引擎:「The new evergreen Googlebot」。

這樣針對 JS 的 SEO 省了不少事情...

Open Distro for Elasticsearch 的比較

先前提到的「AWS 對 Elastic Stack 實作免費的開源版本 Open Distro for Elasticsearch」,在「Open Distro for Elasticsearch Review」這邊有整理了一份重點:

可以看到主要重點都在安全性那塊...