Tag Archives: dataset

對 Open Data 的攻擊手段

前陣子看到的「Membership Inference Attacks against Machine Learning Models」,裡面試著做到的攻擊手法:

[G]iven a data record and black-box access to a model, determine if the record was in the model's training dataset.

也就是拿到一組 Open Data 的存取權限,然後發展一套方法判斷某筆資料是否在裡面。而驗證攻擊的手法當然就是直接攻擊看效果:

We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.

透過 NN 攻擊 NN,而目前的解法也不太好處理,但有做總是會讓精確度降低。論文裡提到了四種讓難度增加的方法:

  • Restrict the prediction vector to top k classes.
  • Coarsen precision of the prediction vector.
  • Increase entropy of the prediction vector.
  • Use regularization.

另外一個值得看的資料是 2006 年發生的「AOL search data leak」,當年資料被放出來後有真實的使用者被找出來,也是很轟動啊...

Google 整理並公開出九百萬張圖片以及對應的 tag

Google 放出了九百萬張以 CC 授權釋出的圖片,標上 tag 後變成 Open Images dataset:「Introducing the Open Images Dataset」,像是這樣:

Annotated images form the Open Images dataset. Left: Ghost Arches by Kevin Krejci. Right: Some Silverware by J B. Both images used under CC BY 2.0 license


The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API.

不過因為這不是人工確認過的資料,如果要拿來做比較精確的研究,還是得用 Amazon Mechanical Turk 這類服務先校正過以確保正確性。

Google BigQuery 提供的 Public Datasets

AWS 的「AWS Public Data Sets」一樣,Google Cloud Platform 也提供了類似的服務給使用 Google BigQuery 的人使用:「Google BigQuery Public Datasets」。

目前資料看起來比較少 (因為最近才建立),包括了這六個項目:

  • USA Names Data
  • NYC TLC Trips
  • Hacker News
  • USA Disease Data
  • GDELT Books Corpus
  • NOAA GSOD Weather

在「Other Public Datasets」的地方就是不寫 AWS 的... XD