Amazon Aurora 可以直接使用 AWS 的 Machine Learning 服務

AWS 宣佈了 Amazon Aurora 可以直接使用 AWS 自家的 Machine Learning 服務:「New for Amazon Aurora – Use Machine Learning Directly From Your Databases」。

整合了兩個服務,分別是 Amazon SageMaker (各類的模型) 以及 Amazon Comprehend (文字處理相關)。

目前只有 Amazon Aurora MySQL 5.7 的版本有支援,其他的還在做:

The new machine learning integration is available today for Aurora MySQL 5.7, with the SageMaker integration generally available and the Comprehend integration in preview. You can learn more in the documentation. We are working on other engines and versions: Aurora MySQL 5.6 and Aurora PostgreSQL 10 and 11 are coming soon.

這個整合讓程式用起來更方便了...

Amazon Redshift 會自動在背景重新排序資料以增加效能

Amazon Redshift 的新功能,會自動在背景重新排序資料以增加效能:「Amazon Redshift introduces Automatic Table Sort, an automated alternative to Vacuum Sort」。

版本要到更新到 1.0.11118,然後預設就會打開:

This feature is available in Redshift 1.0.11118 and later.

Automatic table sort is now enabled by default on Redshift tables where a sort key is specified.

重新排序的運算會在背景處理,另外帶一些行為學習分析:

Redshift runs the sorting in the background and re-organizes the data in tables to maintain sort order and provide optimal performance. This operation does not interrupt query processing and reduces the compute resources required by operating only on frequently accessed blocks of data. It prioritizes which blocks of table to sort by analyzing query patterns using machine learning.

算是丟著讓他跑就好的東西,升級上去後可以看一下 CloudWatch 的報告,這邊沒有特別講應該是還好... XD

擋 Live 與 Podcast 內廣告的工具

看到「An adblocker for live radio streams and podcasts. Machine learning meets Shazam.」這個專案,這個把 machine learning 用到「正途」上了啊...

不過畢竟是比較複雜的演算法,會吃不少 CPU 資源:

On a regular laptop CPU and with the Python time-frequency analyser, computations run at 5-10X for files and at 10-20% usage for live stream.

不過看用法還是偏向 library 性質,如果要大力推廣可能還是需要有其他人包個更好的界面...

用 Machine Learning 改善 Streaming 品質的服務與論文

Hacker News 上看到「Puffer」這個服務,裡面利用了 machine learning algorithm 動態調整 bitrate,以提昇傳輸品質。

測試得到的數據後來被整理起來一起放進論文:「Continual learning improves Internet video streaming」。

在開頭介紹了 Fugu 這個演算法:

We describe Fugu, a continual learning algorithm for bitrate selection in streaming video.

而 Puffer 就是實驗網站:

We evaluate Fugu with Puffer, a public website we built that streams live TV using Fugu and existing algorithms. Over a nine-day period in January 2019, Puffer streamed 8,131 hours of video to 3,719 unique users.

這個站台提供了許多真實的頻道進行測試:

Stream live TV in your browser. There's no charge. You can watch U.S. TV stations affiliated with the NBC, CBS, ABC, PBS, FOX, and Univision networks.

可以看到 Fugu 的結果很好,比起其他提出的方案是全面性的改善:

這邊是用 WebSocket 測試,並且配合了不同的 TCP congestion algorithm,沒有太考慮額外的計算成本...

AI 版的星海爭霸二將直接透過歐洲區的 Battle.net 匿名與人類對戰

前幾天 Blizzard 公佈的消息,DeepMind 的星海爭霸二 AI (AlphaStar) 將會透過 Blizzard 的 Battle.net 歐洲區伺服器跟人類對戰:「DeepMind Research on Ladder」。

Experimental versions of DeepMind’s StarCraft II agent, AlphaStar, will soon play a small number of games on the competitive ladder in Europe as part of ongoing research into AI.

預設是不會對到的,需要選擇參與:

If you would like the chance to help DeepMind with its research by matching against AlphaStar, you can opt in by clicking the “opt-in” button on the in-game popup window. You can alter your opt-in selection at any time by using the “DeepMind opt-in” button on the 1v1 Versus menu.

但你仍然不會知道對手是人還是 AI,而且如同一般對戰情況,這會影響到你的戰績:

For scientific test purposes, DeepMind will be benchmarking AlphaStar’s performance by playing anonymously during a series of blind trial matches. This means the StarCraft community will not know which matches AlphaStar is playing, to help ensure that all games are played under the same conditions. AlphaStar plays with built-in restrictions that the DeepMind team has defined in consultation with pro players. A win or a loss against AlphaStar will affect your MMR as normal.

okay,這樣大概知道為什麼只開放歐洲區了...

加州從今年七月開始,禁止 AI 偽裝成人類 (前幾天也有一些新聞在報導):「A California law now means chatbots have to disclose they’re not human」,對應的法條在「Bill Text - SB-1001 Bots: disclosure」這邊可以看到:

17941. (a) It shall be unlawful for any person to use a bot to communicate or interact with another person in California online, with the intent to mislead the other person about its artificial identity for the purpose of knowingly deceiving the person about the content of the communication in order to incentivize a purchase or sale of goods or services in a commercial transaction or to influence a vote in an election. A person using a bot shall not be liable under this section if the person discloses that it is a bot.

(b) The disclosure required by this section shall be clear, conspicuous, and reasonably designed to inform persons with whom the bot communicates or interacts that it is a bot.

而加州是 Blizzard Entertainment 的總部...

法條上面對「online platform」有設計排除條款,不過如果只算星海二的人數,有可能不到這個豁免限制... 所以得避開而改用歐洲區來測試?

(c) “Online platform” means any public-facing Internet Web site, Web application, or digital application, including a social network or publication, that has 10,000,000 or more unique monthly United States visitors or users for a majority of months during the preceding 12 months.

(c) This chapter does not impose a duty on service providers of online platforms, including, but not limited to, Web hosting and Internet service providers.

美國軍方應該是超級關注這個議題,相較於 AlphaGo 或是 AlphaZero 是資訊完全透明的遊戲,這次要踏入非對稱資訊的遊戲。

如果在這個領域上有成果的話,可以預期未來的戰爭 (yeah 實體戰爭) 會開始大量採用 AI 了...

假新聞產生器與偵測器

Hacker News 上看到的消息,是關於「使用類神經網路產生新聞」(也就是透過程式大量產生假新聞),這次的結果包括了「產生」與「偵測」兩個面向:「Grover – A State-of-the-Art Defense Against Neural Fake News (allenai.org)」。

實驗的網站在「Grover - A State-of-the-Art Defense against Neural Fake News」這邊,另外也有論文「Defending Against Neural Fake News」可以讀。

幾個月前,OpenAI 利用類神經網路,研發出「自動寫新聞」的程式,當時他們宣稱因為效果太好,決定不完整公開成果:「Better Language Models and Their Implications」,中文的報導可以參考 iThome 這篇:「AI文字產生技術引發假新聞爭議,OpenAI決定只公開部份技術成果」。

而現在 The Allen Institute for Artificial Intelligence 則是成功重製了 OpenAI 的成果,取名叫 Grover,發現訓練出來的模型除了可以拿來寫新聞外,也可以拿來偵測文章是不是機器產生的,而且就他們自己測試,辨識成功率還蠻高的:

To study and detect neural fake news, we built a model named Grover. Our study presents a surprising result: the best way to detect neural fake news is to use a model that is also a generator. The generator is most familiar with its own habits, quirks, and traits, as well as those from similar AI models, especially those trained on similar data, i.e. publicly available news. Our model, Grover, is a generator that can easily spot its own generated fake news articles, as well as those generated by other AIs. In a challenging setting with limited access to neural fake news articles, Grover obtains over 92% accuracy at telling apart human-written from machine-written news. Please read our publication for more information.

不過看起來 source code 與 model 還是沒放出來,但看起來遲早會有對應的 open source clone...

我想到在攻殼電視動畫裡面的情報管制戰,雖然電視動畫裡沒有講得很詳細,但感覺這類工具就是其中一環...

用 Google Docs 惡搞的方式...

看到「UDS : Unlimited Drive Storage」這個專案,利用 Google Docs 存放資料。主要的原因是因為 Google Docs 不計入 Google Drive 所使用的空間:

Google Docs take up 0 bytes of quota in your Google Drive

用這個方法可以存放不少大檔案 (像是各種 ISO image),讓人想起當年 Love Machine 的玩法 (不知道的人可以參考「愛的機器 Love machine」這篇),切割檔案後傳到某些空間以提供下載?只是這邊是用 base64 放到 Google Docs 上...

base64 的資料會比原始資料大 33%,而 Google Docs 單篇的上限大約是 710KB:

Size of the encoded file is always larger than the original. Base64 encodes binary data to a ratio of about 4:3.

A single google doc can store about a million characters. This is around 710KB of base64 encoded data.

方法不是太新鮮,但是讓人頗懷念的... XD

Word2Vec:透過向量猜測其他詞彙的意思

2013 年時在「Automatic Translation Without Dictionaries」這邊看到關於機器翻譯時的自我學習方式,裡面提到了「How Google Converted Language Translation Into a Problem of Vector Space Mathematics」這篇報導,而裡面提到的論文則是 Google 發表在 arXiv 上的「Exploiting Similarities among Languages for Machine Translation」這篇。

最近看到「The Illustrated Word2vec」這篇,把五年多前的記錄交叉拉出來看... 這個算式算是給了大家基本的想法,透過公式來解釋文字的意義:

拉出這樣的關係後,就有機會學習新的詞彙... 進而用在其他語言的翻譯上。

用 NN 演算法重製 Full HD 版的 Star Trek: DS9

看到「Remastering Star Trek: Deep Space Nine With Machine Learning」這篇,裡面用了類神經網路演算法,將本來只有 480p (SD) 的 Star Trek: DS9 升到 1080p (Full HD) 的版本,而且看起來效果還不錯...

意外的看到有人拿 Star Trek 的材料來玩... 依照作者的說明,DS9 一直沒有 Full HD 版的其中一個原因反而是因為「數位化」了。使用類比膠卷的母帶可以透過更高規格的重新掃描而得到高畫質版本,但 DS9 的母帶似乎已經是數位版了,所以反而造成無法透過重新掃描的方式取得 Full HD 版本:

While you can rescan analog film at a higher resolution, video is digital and can't be rescanned. This makes it much costlier to remaster this TV show, which is one of the reasons why it hasn't happened.

現有的 upscale 技術主要都還是以圖片為主,所以作者本來以為對於動態畫面的處理會遇到問題,但蠻意外的超出預期,從影片可以看出來:

看起來之後的 remaster 版本有可能可以靠這個方法先做初步,然後再讓人進去修?

LinkedIn 用機器學習提供雇主可能的職缺對象

先前看到「Learning Hiring Preferences: The AI Behind LinkedIn Jobs」這篇,LinkedIn 用機器學習提供雇主可能的對象。

依照官方的說法,這次提到的改進是透過雇主的行為調整推薦。當雇主對某個人有興趣的時候,LinkedIn 就會調整演算法去配合雇主有興趣的條件:

Based on how you interact with candidates, our algorithm learns your preferences and delivers increasingly relevant candidates across the Jobs product. If you’re consistently interested in candidates who are, say, accountants with leadership skills, or project managers who are adept at social media, we’ll send you more of those. And this all happens online in real time so that your feedback is taken instantly into account.

透過模擬 20% 的加成:

This new algorithm, which is used throughout the Jobs platform, performs nearly 20% better than the previous version in generating recommendations when we simulate our members' past hiring activity.

在 social network 這種演算法其實就是同溫層 (Echo chamber、Filter bubble),在 LinkedIn 這樣的行為不知道會不會牽扯到 Discrimination 的議題...