比 Bloom filter 與 Cuckoo filter 再更進一步的 Xor filter

Bloom filter 算是教科書上的經典演算法之一,在實際應用上有更好的選擇,像是先前提到的 Cuckoo filter:「Cuckoo Filter:比 Bloom Filter 多了 Delete」。

現在又有人提出新的資料結構,號稱又比 Bloom filter 與 Cuckoo filter 好:「Xor Filters: Faster and Smaller Than Bloom Filters」。

不過並不是完全超越,其中馬上可以看到的差異就是不支援 delete:

Deletions are generally unsafe with these filters even in principle because they track hash values and cannot deal with collisions without access to the object data: if you have two objects mapping to the same hash value, and you have a filter on hash values, it is going to be difficult to delete one without the other.

論文的預印本可以在 arXiv 上下載:「Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters」。

引用自己論文的問題...

Nature 上點出來期刊論文裡自我引用的問題 (這邊的自我引用包括了合作過的人):「Hundreds of extreme self-citing scientists revealed in new database」。

開頭舉了一個極端的例子,Vaidyanathan 的自我引用比率高達 94%,而學界的中位數是 12.7%,感覺是有某種制度造成的行為?

Vaidyanathan, a computer scientist at the Vel Tech R&D Institute of Technology, a privately run institute, is an extreme example: he has received 94% of his citations from himself or his co-authors up to 2017, according to a study in PLoS Biology this month. He is not alone. The data set, which lists around 100,000 researchers, shows that at least 250 scientists have amassed more than 50% of their citations from themselves or their co-authors, while the median self-citation rate is 12.7%.

會想要提是因為想到當年 Google 的經典演算法 PageRank,就是在處理這個問題... 把 paper 換成 webpage 而已。

用 Machine Learning 改善 Streaming 品質的服務與論文

Hacker News 上看到「Puffer」這個服務,裡面利用了 machine learning algorithm 動態調整 bitrate,以提昇傳輸品質。

測試得到的數據後來被整理起來一起放進論文:「Continual learning improves Internet video streaming」。

在開頭介紹了 Fugu 這個演算法:

We describe Fugu, a continual learning algorithm for bitrate selection in streaming video.

而 Puffer 就是實驗網站:

We evaluate Fugu with Puffer, a public website we built that streams live TV using Fugu and existing algorithms. Over a nine-day period in January 2019, Puffer streamed 8,131 hours of video to 3,719 unique users.

這個站台提供了許多真實的頻道進行測試:

Stream live TV in your browser. There's no charge. You can watch U.S. TV stations affiliated with the NBC, CBS, ABC, PBS, FOX, and Univision networks.

可以看到 Fugu 的結果很好,比起其他提出的方案是全面性的改善:

這邊是用 WebSocket 測試,並且配合了不同的 TCP congestion algorithm,沒有太考慮額外的計算成本...

Elsevier 限制加州大學的存取權限

三月的時候加州大學系統 (UC) 因為 Elsevier 不接受 open access 的條件而公開宣佈不續約 (參考「加州大學宣佈不與 Elsevier 續約」),後來 Elsevier 應該是試著看看有沒有機會繼續合作,所以在這段期間還是一直提供服務給加州大學系統。

前幾天在 Hacker News 上看到「Elsevier cuts off UC’s access to its academic journals (latimes.com)」,總算是確定要動手了:「In act of brinkmanship, a big publisher cuts off UC’s access to its academic journals」。

不過也不是直接拔掉,而是限制存取權,看不到新東西 (以 2019/01/01 為界):

As of Wednesday, Elsevier cut off access by UC faculty, staff and students to articles published since Jan. 1 in 2,500 Elsevier journals, including respected medical publications such as Cell and the Lancet and a host of engineering and scientific journals. Access to most material published in 2018 and earlier remains in force.

UC 提出的商業模式是讓投稿者負擔費用,而存取者不需要負擔,與現有的商業模式剛好相反。UC 提出的模式鼓勵「知識的散佈」,而現有的商業模式則是反過來,希望透過知識的散佈而賺~大~錢~發~大~財~:

UC demanded that the new contract reflect the principle of open access — that work produced on its campuses be available to all outside readers, for free.

That was a direct challenge to the business model of Elsevier and other big academic publishers. Traditionally, the publishers accept papers for publication for free but charge steep subscription fees. UC is determined to operate under an alternative model, in which researchers pay to have their papers published but not for subscriptions.

另外在 Hacker News 上的 comment 裡看到一些專案也正在進行,像是歐洲的「Plan S」也是在推動 open access:

The plan requires scientists and researchers who benefit from state-funded research organisations and institutions to publish their work in open repositories or in journals that are available to all by 2021.

另外「PubPub · Community Publishing」也是 open source 領域裡蠻有趣的計畫,後面看起來也有不少學術單位在支持。

加州大學宣佈不與 Elsevier 續約

加州大學 (這是一個大學系統,包括了十個校區,超過 25 萬的學生與 14 萬的教職員) 認為 Elsevier 沒有達到 open access 應有的標準,決定將不再跟 Elsevier 續約,並且發出新聞稿抨擊 Elsevier:「UC terminates subscriptions with world’s largest scientific publisher in push for open access to publicly funded research」。

As a leader in the global movement toward open access to publicly funded research, the University of California is taking a firm stand by deciding not to renew its subscriptions with Elsevier. Despite months of contract negotiations, Elsevier was unwilling to meet UC’s key goal: securing universal open access to UC research while containing the rapidly escalating costs associated with for-profit journals.

這應該是美國頂尖學院裡面的第一槍?後續會帶動多少單位不續訂...

歐洲研究機構的資助者推動研究論文的開放存取

在「Radical open-access plan could spell end to journal subscriptions」這邊看到歐洲 11 個研究機構資助者成立了「cOAlition S」,推動研究論文的開放存取。

目標是在 2020 年開始,由這些機構所資助的研究都必須投在符合完全開放條件的平台上:

cOAlition S signals the commitment to implement, by 1 January 2020, the necessary measures to fulfil its main principle: “By 2020 scientific publications that result from research funded by public grants provided by participating national and European research councils and funding bodies, must be published in compliant Open Access Journals or on compliant Open Access Platforms.

而現在大約只有 15%:

According to a December 2017 analysis, only around 15% of journals publish work immediately as open access (see ‘Publishing models’) — financed by charging per-article fees to authors or their funders, negotiating general open-publishing contracts with funders, or through other means.

用這種方式降低那些收錢才能下載的平台的影響力...

Elsevier 讓德國的研究機構在還沒有續約的情況下繼續使用

德國的研究機構在 2017 年年底前,也就是與 Elsevier 的合約到期前,還是沒有續約,但 Elsevier 決定還是先繼續提供服務,暫時性的為期一年,繼續談判:

The Dutch publishing giant Elsevier has granted uninterrupted access to its paywalled journals for researchers at around 200 German universities and research institutes that had refused to renew their individual subscriptions at the end of 2017.

The institutions had formed a consortium to negotiate a nationwide licence with the publisher. They sought a collective deal that would give most scientists in Germany full online access to about 2,500 journals at about half the price that individual libraries have paid in the past. But talks broke down and, by the end of 2017, no deal had been agreed. Elsevier now says that it will allow the country’s scientists to access its paywalled journals without a contract until a national agreement is hammered out.

Elsevier 會這樣做主要是要避免讓德國的學術機構發現「沒有 Elsevier 其實也活的很好」。而不少研究人員已經知道這件事情,在大多數的情況下都有 Elsevier 的替代方案,不需要浪費錢簽那麼貴的費用:

Günter Ziegler, a mathematician at the Free University of Berlin and a member of the consortium's negotiating team, says that German researchers have the upper hand in the negotiations. “Most papers are now freely available somewhere on the Internet, or else you might choose to work with preprint versions,” he says. “Clearly our negotiating position is strong. It is not clear that we want or need a paid extension of the old contracts.”

替代方案有幾個方面,像是自由開放下載的 arXiv 愈來愈受到重視,很多研究者都會把投稿的論文在上面放一份 pre-print 版本 (甚至會更新),而且近年來有些知名的證明只放在上面 (像是 Poincaré conjecture)。而且放在人家家裡比放在自己網站來的簡單 (不需要自己維護),這都使得 arXiv 變成學術界新的標準平台。

除了 arXiv 外,其他領域也有自己習慣的平台。像是密碼學這邊的「Cryptology ePrint Archive」也運作很久了。

除了找平台外,放在自家網站上的論文 (通常是學校或是學術機構的個人空間),也因為搜尋引擎的發達,使得大家更容易找到對應檔案可以下載。

而且更直接的攻擊性網站是 Sci-Hub,讓大家從 paywall 下載後丟上去公開讓人搜尋。雖然因為常常被封鎖的原因而常常在換網址,不過透過 Tor Browser (或是自己設定 Tor Proxy) 存取他們的 Hidden Service 就應該沒這個問題。

希望德國可以撐下去,證明其實已經不需要 Elsevier...

基於 RNN 的無損壓縮

Hacker News 上看到「DeepZip: Lossless Compression using Recurrent Networks」這篇論文,利用 RNN 幫助壓縮技術壓的更小,而程式碼在 GitHubkedartatwawadi/NN_compression 上有公開讓大家可以測試。

裡面有個比較特別的是,Lagged Fibonacci PRNG 產生出來的資料居然有很好的壓縮率,這在傳統的壓縮方式應該都是幾乎沒有壓縮率...

整體的壓縮率都還不錯,不過比較的對象只有 gzip,沒有拿比較先進的壓縮軟體進行比較) 像是 xz 之類的),看數字猜測在一般的情況下應該不會贏太多,不過光是 PRNG 那部份,這篇論文等於是給了一個不同的方向讓大家玩...

Facebook 自己找人研究,Social Media 是否對人類有害 XDDD

之前看到「Hard Questions: Is Spending Time on Social Media Bad for Us?」這篇,一直不知道要怎麼吐槽... 然後看到 Twitter 上的這則 tweet XDDD

既視感太重了,找了一下其他行業對應的資料:

真的不知道怎麼吐槽 XDDD

reddit 與 4chan 在新聞網路上的獨特性

在「Study finds fringe communities on Reddit and 4chan have high influence on flow of alternative news to Twitter」這邊看到的:

After analyzing millions of posts containing mainstream and alternative news shared on Twitter, Reddit and 4chan, Jeremy Blackburn, Ph.D., and collaborators found that alt-right communities within 4chan, an image-based discussion forum where users are anonymous, and Reddit, a social news aggregator where users vote up or down on posts, have a surprisingly large influence on Twitter.

依照對 reddit4chan 的描述,這兩個媒體對 Twitter 的影響,會讓我聯想到在台灣 Ptt 對各新聞媒體的影響:Ptt 是很多新聞的起點?

"Based on our findings, these smaller, fringe communities on Reddit and 4chan serve as an incubation chamber for a lot of information," said Blackburn, assistant professor of computer science in the UAB College of Arts and Sciences. "The content and talking points are refined until they finally break free and make it to larger, more mainstream communities."

真的研究應該可以看出 Ptt 的影響力?