貓叫聲的 Dataset

Hacker News Daily 上看到「CatMeows: A Publicly-Available Dataset of Cat Vocalizations」這個,貓叫聲的 dataset... 對應的討論在「CatMeows: A Publicly-Available Dataset of Cat Vocalizations (2020) (zenodo.org)」這邊。

包括了三種聲音 XDDD

  1. Brushing - Cats were brushed by their owners in their home environment for a maximum of 5 minutes;
  2. Isolation in an unfamiliar environment - Cats were transferred by their owners into an unfamiliar environment (e.g., a room in a different apartment or an office). Distance was minimized and the usual transportation routine was adopted so as to avoid discomfort to animals. The journey lasted less than 30 minutes and cats were allowed 30 minutes with their owners to recover from transportation, before being isolated in the unfamiliar environment, where they stayed alone for maximum 5 minutes;
  3. Waiting for food - The owner started the routine operations that preceded food delivery in the usual environment the cat was familiar with. Food was given at most 5 minutes after the beginning of the experiment.

嚕貓,焦慮與等待餵食?好像是可以想到這組 dataset 的用途...

Splitgraph 把公開資料轉成 PostgreSQL 服務

看到「Port 5432 is open: introducing the Splitgraph Data Delivery Network」這個,Splitgraph 把 public dataset 轉成 PostgreSQL 服務:

We launch the Splitgraph Data Delivery Network: a single endpoint that lets any PostgreSQL application, client or BI tool to connect and query over 40,000 public datasets hosted or proxied by Splitgraph.

這樣看起來可以讓很多 BI 工具直接接進來用,如果再支援 ODBC 或是 JDBC 的話通用性就更好了,但目前沒看到定價策略 (感覺應該頗好賣的?),應該是還在開發階段?

丟給公司內 data team 看一下好了,如果資料的品質沒問題的話,感覺會是個好用的服務...

對 Open Data 的攻擊手段

前陣子看到的「Membership Inference Attacks against Machine Learning Models」,裡面試著做到的攻擊手法:

[G]iven a data record and black-box access to a model, determine if the record was in the model's training dataset.

也就是拿到一組 Open Data 的存取權限,然後發展一套方法判斷某筆資料是否在裡面。而驗證攻擊的手法當然就是直接攻擊看效果:

We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.

透過 NN 攻擊 NN,而目前的解法也不太好處理,但有做總是會讓精確度降低。論文裡提到了四種讓難度增加的方法:

  • Restrict the prediction vector to top k classes.
  • Coarsen precision of the prediction vector.
  • Increase entropy of the prediction vector.
  • Use regularization.

另外一個值得看的資料是 2006 年發生的「AOL search data leak」,當年資料被放出來後有真實的使用者被找出來,也是很轟動啊...

Google 整理並公開出九百萬張圖片以及對應的 tag

Google 放出了九百萬張以 CC 授權釋出的圖片,標上 tag 後變成 Open Images dataset:「Introducing the Open Images Dataset」,像是這樣:

Annotated images form the Open Images dataset. Left: Ghost Arches by Kevin Krejci. Right: Some Silverware by J B. Both images used under CC BY 2.0 license

不過這不是人類分類出來的結果,而是機械學習的成果:

The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API.

不過因為這不是人工確認過的資料,如果要拿來做比較精確的研究,還是得用 Amazon Mechanical Turk 這類服務先校正過以確保正確性。

Google BigQuery 提供的 Public Datasets

AWS 的「AWS Public Data Sets」一樣,Google Cloud Platform 也提供了類似的服務給使用 Google BigQuery 的人使用:「Google BigQuery Public Datasets」。

目前資料看起來比較少 (因為最近才建立),包括了這六個項目:

  • USA Names Data
  • NYC TLC Trips
  • Hacker News
  • USA Disease Data
  • GDELT Books Corpus
  • NOAA GSOD Weather

在「Other Public Datasets」的地方就是不寫 AWS 的... XD