arXiv 提供 HTML 版本介面 (beta 版)

Hacker News 上看到「ArXiv now offers papers in HTML format (arxiv.org)」這則,arXiv 推出了 beta 版的 HTML 介面:「Accessibility update: arXiv now offers papers in HTML format」。

不是每一篇都有上,需要是最近用 TeX 類格式上傳的才會轉:

We are happy to announce that as of Monday, December 18th, arXiv is now generating an HTML formatted version of all papers submitted in TeX/LaTeX (as long as papers were submitted on or after December 1st, 2023 and HTML conversion is successful – more on this below).

所以我先找了二十年前 Poincaré conjecture (龐加萊猜想) 的三篇論文,就沒有 HTML 版本:「The entropy formula for the Ricci flow and its geometric applications」、「Ricci flow with surgery on three-manifolds」、「Finite extinction time for the solutions to the Ricci flow on certain three-manifolds」。

Hacker News 的 comment 裡面有人給了有 HTML 版本的論文:「The detectability of single spinless stellar-mass black holes through gravitational lensing of gravitational waves with advanced LIGO」,以 render 的效果看起來還不錯?

另外這個站目前看起來沒有在 Fastly 上:

;; ANSWER SECTION:
browse.arxiv.org.       300     IN      A       34.160.61.147

應該等成熟進 GA 時會把所有 TeX 檔案都轉出來?

arXiv 上了 Fastly CDN

看到 arXiv 宣佈上了 FastlyCDN:「Faster arXiv with Fastly」。

翻了一下 arxiv.org 的 DNS record,可以看到現在是這樣:

;; ANSWER SECTION:
arxiv.org.              10      IN      A       151.101.131.42
arxiv.org.              10      IN      A       151.101.3.42
arxiv.org.              10      IN      A       151.101.67.42
arxiv.org.              10      IN      A       151.101.195.42

mtr 測試,看起來 HiNet 過去的 routing 還是進到新加坡。

不過 static.arxiv.org 是在 CloudFront 上:

;; ANSWER SECTION:
static.arxiv.org.       3600    IN      CNAME   daa2ks08y5ls.cloudfront.net.
daa2ks08y5ls.cloudfront.net. 60 IN      A       13.35.35.100
daa2ks08y5ls.cloudfront.net. 60 IN      A       13.35.35.29
daa2ks08y5ls.cloudfront.net. 60 IN      A       13.35.35.88
daa2ks08y5ls.cloudfront.net. 60 IN      A       13.35.35.127

依照官方的說明看起來還在換,只是不知道已經在 CloudFront 上的 (像是上面提到的 static.arxiv.org) 會不會換過去:

That includes our home page, listings, abstracts, and papers — both PDF and HTML (more on that soon).

用 PageRank 跑 arXiv 上面 CS paper 的排名

在「Ask HN: AI/ML papers to catch up with current state of AI?」這邊看到的,本來只是在討論有哪些 AI/ML paper 可以看,結果在 id=38654200 這邊看到這個網站,上面的資料是每天更新一次:

https://trendingpapers.com/

This tool can help you find what's new & relevant to read. It's updated every day (based on ArXiv).

You can filter by category (Computer Vision, Machine Learning, NLP, etc), by release date, but most importantly, you can rank by PageRank (proxy of influence/readership), PageRank growth (to see the fastest growing papers in terms of influence), total # of citations, etc...

依照「Frequently Asked Questions」的說明,是用 PageRankarXiv 上面的 paper,主要是 CS 為主。

難得看到 PageRank 出現而且是用在 paper citation 上面...

比 Bloom filter 與 Cuckoo filter 再更進一步的 Xor filter

Bloom filter 算是教科書上的經典演算法之一,在實際應用上有更好的選擇,像是先前提到的 Cuckoo filter:「Cuckoo Filter:比 Bloom Filter 多了 Delete」。

現在又有人提出新的資料結構,號稱又比 Bloom filter 與 Cuckoo filter 好:「Xor Filters: Faster and Smaller Than Bloom Filters」。

不過並不是完全超越,其中馬上可以看到的差異就是不支援 delete:

Deletions are generally unsafe with these filters even in principle because they track hash values and cannot deal with collisions without access to the object data: if you have two objects mapping to the same hash value, and you have a filter on hash values, it is going to be difficult to delete one without the other.

論文的預印本可以在 arXiv 上下載:「Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters」。

Elsevier 讓德國的研究機構在還沒有續約的情況下繼續使用

德國的研究機構在 2017 年年底前,也就是與 Elsevier 的合約到期前,還是沒有續約,但 Elsevier 決定還是先繼續提供服務,暫時性的為期一年,繼續談判:

The Dutch publishing giant Elsevier has granted uninterrupted access to its paywalled journals for researchers at around 200 German universities and research institutes that had refused to renew their individual subscriptions at the end of 2017.

The institutions had formed a consortium to negotiate a nationwide licence with the publisher. They sought a collective deal that would give most scientists in Germany full online access to about 2,500 journals at about half the price that individual libraries have paid in the past. But talks broke down and, by the end of 2017, no deal had been agreed. Elsevier now says that it will allow the country’s scientists to access its paywalled journals without a contract until a national agreement is hammered out.

Elsevier 會這樣做主要是要避免讓德國的學術機構發現「沒有 Elsevier 其實也活的很好」。而不少研究人員已經知道這件事情,在大多數的情況下都有 Elsevier 的替代方案,不需要浪費錢簽那麼貴的費用:

Günter Ziegler, a mathematician at the Free University of Berlin and a member of the consortium's negotiating team, says that German researchers have the upper hand in the negotiations. “Most papers are now freely available somewhere on the Internet, or else you might choose to work with preprint versions,” he says. “Clearly our negotiating position is strong. It is not clear that we want or need a paid extension of the old contracts.”

替代方案有幾個方面,像是自由開放下載的 arXiv 愈來愈受到重視,很多研究者都會把投稿的論文在上面放一份 pre-print 版本 (甚至會更新),而且近年來有些知名的證明只放在上面 (像是 Poincaré conjecture)。而且放在人家家裡比放在自己網站來的簡單 (不需要自己維護),這都使得 arXiv 變成學術界新的標準平台。

除了 arXiv 外,其他領域也有自己習慣的平台。像是密碼學這邊的「Cryptology ePrint Archive」也運作很久了。

除了找平台外,放在自家網站上的論文 (通常是學校或是學術機構的個人空間),也因為搜尋引擎的發達,使得大家更容易找到對應檔案可以下載。

而且更直接的攻擊性網站是 Sci-Hub,讓大家從 paywall 下載後丟上去公開讓人搜尋。雖然因為常常被封鎖的原因而常常在換網址,不過透過 Tor Browser (或是自己設定 Tor Proxy) 存取他們的 Hidden Service 就應該沒這個問題。

希望德國可以撐下去,證明其實已經不需要 Elsevier...

在網頁上看 arXiv 的論文

Hacker News Daily 上看到的服務「Arxiv Vanity – Read academic papers from Arxiv as web pages」:

Arxiv Vanity renders academic papers from Arxiv as responsive web pages so you don’t have to squint at a PDF.

不過實際測試發現只有有提供 TeX 格式原始檔才有辦法轉,沒提供的就不行了...

Google 的軟體開發

之前有不少 Google 內軟體開發的說明 (像是 2015 年的「Google Is 2 Billion Lines of Code—And It’s All in One Place」),不過這好像是第一次以 paper 的形式整理出來:「Software Engineering at Google」。

當你有一群等級超高的工程師時,軟體工程裡面一堆假設都被推翻,然後一堆工具都是客製化自己開發 (有可能是那個時間點還沒有成熟的工具,也有可能是需要大量客製化),於是就會看到各種有趣的解法... XD

拿來看看還可以,拿來抄八成會出事 XDDD

從 arXiv 上挖寶的網站

Hacker News 上的「Ask HN: How do you get notified about newest research papers in your field?」在問有什麼方法可以找到新的論文,前面的回答就有不少好東西...

一個是 Arxiv Sanity Preserver,另外一個是 GitXiv,兩個都是從 arXiv 上挖寶,先記錄起來,之後拿來翻東西應該會用到...