雖然 GitHub 有提供 license 相關的 API 可以查,但因為準確度不高 (只要稍微改到,GitHub 就無法偵測到正確的 license),所以有人決定用 machine learning 的方式另外分析:「Detecting licenses in code with Go and ML」。當然這邊是分析公開的部份:

最大包的是 MIT License,次之是 Apache-2.0 (問號那群先不管),再來是 GPL 家族的各版本。沒有太特別的意外發生...
幹壞事是進步最大的原動力
雖然 GitHub 有提供 license 相關的 API 可以查,但因為準確度不高 (只要稍微改到,GitHub 就無法偵測到正確的 license),所以有人決定用 machine learning 的方式另外分析:「Detecting licenses in code with Go and ML」。當然這邊是分析公開的部份:
最大包的是 MIT License,次之是 Apache-2.0 (問號那群先不管),再來是 GPL 家族的各版本。沒有太特別的意外發生...
看到 GitHub 上的「emilwallner/Screenshot-to-code-in-Keras」這個專案,直接把圖片轉成 HTML。介紹的文章則是「Turning Design Mockups Into Code With Deep Learning」。
有點像是「將 Sketch 輸出成 iOS/Android 的程式碼」與「透過 NN (類神經網路) 訓練好的系統,直接把圖片轉成程式碼」(後面這篇剛好在介紹文章裡也有提到)。
愈來愈有 NN 在逐步取代人類工作的感覺了...
有一些 Windows 上的東西就可以直接開起來跑了:「Announcing New AWS Deep Learning AMI for Microsoft Windows」。
目前支援 2012 R2 與 2016:
Amazon Web Services now offers an AWS Deep Learning AMI for Microsoft Windows Server 2012 R2 and 2016.
然後 driver 與常用的東西都包進去了:
The AMIs also include popular deep learning frameworks such as Apache MXNet, Caffe and Tensorflow, as well as packages that enable easy integration with AWS, including launch configuration tools and many popular AWS libraries and tools. The AMIs come prepackaged with Nvidia CUDA 9, cuDNN 7, and Nvidia 385.54 drivers, and contain the Anaconda platform (supports Python versions 2.7 and 3.5).
Bruce Schneier 提到了最近幾個剛好相關的議題,關於機器學習在情色產業使用時遇到的隱私議題:「Technology to Out Sex Workers」。
第一個提到的是 PornHub 用機器學習辨識演員以及各種「其他資訊」,這邊引用的報導是 TechCrunch 的「PornHub uses computer vision to ID actors, acts in its videos」:
PornHub is using machine learning algorithms to identify actors in different videos, so as to better index them.
The computer vision system can identify specific actors in scenes and even identifies various positions and… attributes.
第二個提到的是花名與真實身份連在一起的問題:
People are worried that it can really identify them, by linking their stage names to their real names.
最後是提到 Facebook 已經有能力這樣做,而且已經發生了:
Facebook somehow managed to link a sex worker's clients under her fake name to her real profile.
Her sex-work identity is not on the social network at all; for it, she uses a different email address, a different phone number, and a different name. Yet earlier this year, looking at Facebook’s “People You May Know” recommendations, Leila (a name I’m using using in place of either of the names she uses) was shocked to see some of her regular sex-work clients.
這個議題與 Mass surveillance 有點像...。
Blizzard 公佈了在十一月的月初將會舉辦星海二的 AI Workshop:「Announcing the StarCraft II AI Workshop」。
On November 3 and 4, Blizzard and DeepMind will co-host the StarCraft II AI Workshop at the Hilton Anaheim hotel, next to the Anaheim Convention Center.
官方 (包括 DeepMind 團隊) 也會針對 SC2LE (Starcraft II Learning Environment) 與 SC2API (StarCraft II API) 提供交流:
Engineers and researchers from Blizzard and DeepMind will also be on-hand to meet with attendees and answers questions about the SC2LE and SC2API.
然後時間會跟 BlizzCon 2017 重疊 (目前看起來是卡到最後兩天),票是不能通用的:
While this event takes place during BlizzCon 2017, it is considered a separate event and is not part of the official BlizzCon program – therefore BlizzCon badges will not grant access to the AI workshop. However, we will be providing a limited pool of shareable BlizzCon badges that attendees of the AI workshop can use to check out BlizzCon and catch the StarCraft II Global Finals for inspiration on how to build superior AIs!
接下來應該會有不少消息出來... DeepMind 團隊的開發進度有可以跟頂尖選手競賽嗎?
前陣子看到的「Membership Inference Attacks against Machine Learning Models」,裡面試著做到的攻擊手法:
[G]iven a data record and black-box access to a model, determine if the record was in the model's training dataset.
也就是拿到一組 Open Data 的存取權限,然後發展一套方法判斷某筆資料是否在裡面。而驗證攻擊的手法當然就是直接攻擊看效果:
We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.
透過 NN 攻擊 NN,而目前的解法也不太好處理,但有做總是會讓精確度降低。論文裡提到了四種讓難度增加的方法:
另外一個值得看的資料是 2006 年發生的「AOL search data leak」,當年資料被放出來後有真實的使用者被找出來,也是很轟動啊...
AWS AI Blog 在月初上放出來的消息:「Tuning Your DBMS Automatically with Machine Learning」。
Carnegie Mellon Database Group 做的研究,除了預設值以外,另外跟四種不同的參數做比較,分別是 OtterTune (也就是這次的研究)、Tuning script (對於不熟資料庫的人,常用的 open source 工具)、DBA 手動調整,以及 RDS:
比較明顯的結論是:
至於有些討論 DBA 會失業的事情,我是樂見其成啦... 這些繁瑣的事情可以自動化就想交給自動化吧 XD
Google 宣佈在 GCP 上的機器與服務支援 GPU 運算了:「GPUs are now available for Google Compute Engine and Cloud Machine Learning」。
算是 beta 階段,用的是 NVIDIA Tesla K80:
Google Cloud Platform gets a performance boost today with the much anticipated public beta of NVIDIA Tesla K80 GPUs.
然後台灣的 asia-east1 也在內。第一波必須透過 cli 操作,之後才會在 web console 上加上去:
You can now spin up NVIDIA GPU-based VMs in three GCP regions: us-east1, asia-east1 and europe-west1, using the gcloud command-line tool. Support for creating GPU VMs using the Cloud Console appears next week.
也開始支援 GPU 了...
AWS 這次推出的這兩個服務剛好成對:「Amazon Polly – Text to Speech in 47 Voices and 24 Languages」、「Amazon Lex – Build Conversational Voice & Text Interfaces」。
Amazon Polly 負責把文字唸出來變成語音,而 Amazon Lex 則是將語音辨識回文字,不過目前都還不支援中文... 但畢竟讓 user interface 這塊變得更親民了,算是基礎建設中服務,讓 startup 專心在產品本身上。
Google 放出了九百萬張以 CC 授權釋出的圖片,標上 tag 後變成 Open Images dataset:「Introducing the Open Images Dataset」,像是這樣:
不過這不是人類分類出來的結果,而是機械學習的成果:
The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API.
不過因為這不是人工確認過的資料,如果要拿來做比較精確的研究,還是得用 Amazon Mechanical Turk 這類服務先校正過以確保正確性。