Apache License 2.0 的 RedPajama 7B 釋出

LLaMA 出來以後,打造 open source license 的 LLM 變成大家期待的事情,而 RedPajama 算是蠻多人看好的項目。

結果還在算的過程中間,路上殺出來 Falcon LLM,在釋出當下以一個比較寬鬆的 license (但還不是 open source license),到了六月初直接宣布改用 Apache License, Version 2.0,而且同時放出 7B 與 40B 兩個 model,讓 RedPajama 的消息瞬間被壓下去...

現在 RedPajama 放出 7B 了,而且也宣稱在 HELM 上比 Falcon 7B 好:「RedPajama 7B now available, instruct model outperforms all open 7B models on HELM benchmarks」,在 Hacker News 上對應的討論串在「RedPajama 7B (an Apache 2.0-licensed LLaMa) is now available (together.xyz)」這邊。

不過從這幾個月社群討論的感覺,可以看到大家都覺得 7B 太小了,目前大家都希望是 3090/4090 等級可以跑的顯示卡在當標準,差不多會是 LLaMA 13B 或是 30B (4-bit) 的 model。

這幾個月的競爭太激烈,放話完還沒 release 就被幹掉...

WordPress 誕生 20 年

Matt Mullenweg 寫了一篇文章簡單提到 WordPress 誕生 20 年:「WP20 & Audrey Scholars」。

雖然 Matt Mullenweg 在文章裡都沒提到,但 WordPress 的興起其實跟當年 2004 年最大的 blog 軟體 Movable Type 自己出的包有很大的關係:

With the release of version 3.0 in 2004, there were marked changes in Movable Type's licensing, most notably placing greater restrictions on its use without paying a licensing fee. This sparked criticism from some users of the software, with some moving to the then-new open-source blogging tool WordPress. With the release of Movable Type 3.2, the ability to create an unlimited number of weblogs at all licensing levels was restored. In Movable Type 3.3, the product once again became completely free for personal users.

當年 hlb 在社團主機上用 Movable Type 架了服務讓大家寫,結果後來發生了 license 問題,大家就都順勢跑到 WordPress 上了;而等到 Movable Type 再次想放寬 license 的時候已經來不及了,大家都已經搬完了。

翻了一下最舊的文章 (在另外一個 WordPress 上) 是在 2004 年十月的時候寫的,就有提到當時從 Movable Type 換到 WordPress 的考量:「開場:為什麼用 WordPress」。

目前可商用的 LLM

Ask Hacker News Weekly 上看到的討論,有人問了目前可商用的 LLM 有哪些:「Ask HN: Open source LLM for commercial use?」。

有人提到 GoogleFlan 應該是目前最能打的?在 Hugging Face 上可以下載到:

I've seen this question asked repeatedly in many LLaMa threads, currently the best models that are truly open are the released models from the Flan family by Google, which includes Flan-T5[0] and Flan-UL2[1]. According to its paper, Flan-UL2 performs slightly better than Flan-T5-XXL.

然後差不多是 GPT-3 的等級,離 GPT-3.5 或是演伸出來的 ChatGPT 都還有段距離。但如果針對特定情境下 tune 的話應該還是能用的:

These models perform slightly better than GPT-3 under some tasks[2], but they're still far from achieving the results from GPT-3.5 and GPT-4. This becomes evident when you try to use them in the real world; they're not "good enough" for general use cases, unlike ChatGPT models. However, if you can restrict your use case to one particular domain, you can achieve pretty good results by further fine-tuning these models.

另外一則回覆有提到一些其他的 model:

The ones I saw mentioned so far were Flan, Cerebras, GPT-J, and RWKV.

Not yet mentioned:

* Pythia https://github.com/EleutherAI/pythia

* GLM-130B https://github.com/THUDM/GLM-130B - see also ChatGLM-6B https://github.com/THUDM/ChatGLM-6B

* GPT-NeoX-20B https://huggingface.co/EleutherAI/gpt-neox-20b

* GeoV-9B https://github.com/geov-ai/geov

* BLOOM https://huggingface.co/bigscience/bloom and BLOOMZ https://huggingface.co/bigscience/bloomz

看起來如果有需要用的話是可以從這裡面挖看看...

提議 OpenJDK 的程式碼 UTF-8 化

Hacker News 上看到提議把 OpenJDK 的程式碼 UTF-8 化:「Make JDK source code UTF-8 (openjdk.org)」,原始的 jira ticket 在「Make JDK source code UTF-8」這邊。

主要是現在的 source code 沒有統一標準:

Currently, the source code in the JDK is in an ill-defined encoding. There is no official declaration of the encoding used. It is "mostly ASCII", but the relatively few non-ASCII characters used are not well-defined. In many cases, it is latin-1, but I am pretty certain other encodings are used for e.g. Asian translations.

而 mailing list 上看起來還是有想要維持 ASCII 的討論:「Making the source code utf-8」,不過他提出來的理由我覺得不太行,在 console 跑 vi 或是 emacs 我覺得不太是個好理由...

GitHub 自己開發的搜尋引擎

前陣子 GitHub 發了一篇文章,說明自己開發搜尋引擎的心路歷程:「The technology behind GitHub’s new code search」。

看了一下其實就是自己幹了一套 search engine cluster,然後針對 code search 把一些功能放進去。

目前這套 search enginer 還是 beta 版本,全站兩億個 repository 只包括了 4500 萬 (大概 22% 左右),然後已經有 115TB 的程式碼了;另外也題到了先前導入 Elasticsearch 時的數字是 800 萬個 repository:

GitHub’s scale is truly a unique challenge. When we first deployed Elasticsearch, it took months to index all of the code on GitHub (about 8 million repositories at the time). Today, that number is north of 200 million, and that code isn’t static: it’s constantly changing and that’s quite challenging for search engines to handle. For the beta, you can currently search almost 45 million repositories, representing 115 TB of code and 15.5 billion documents.

目前是 32 台機器,沒有特別提到記憶體大小,也沒有提到 replication 之類的數字:

Code search runs on 64 core, 32 machine clusters.

然後各種 inverted index 與各種資料在壓縮後只有 25TB:

There are some big wins on the size of the index as well. Remember that we started with 115 TB of content that we want to search. Content deduplication and delta indexing brings that down to around 28 TB of unique content. And the index itself clocks in at just 25 TB, which includes not only all the indices (including the ngrams), but also a compressed copy of all unique content. This means our total index size including the content is roughly a quarter the size of the original data!

換算一下,就會發現現在已經是「暴力」可以解很多事情的年代了,而這已經是全世界最大的 code hosting。

以前隨便一個主題搞大一點就會撞到 Amdahl's law,現在輕鬆不少...

華盛頓郵報怎麼把 Mapbox 換成其他 open source 方案

Hacker News 上面看到「How The Post is replacing Mapbox with open source solutions」這篇,講華盛頓郵報怎麼把 Mapbox 換成 open source 方案,對應的討論在「Replacing Mapbox with open source solutions (kschaul.com)」這邊。

維基百科有提到大概兩年前,2020 年底的時候 Mapbox GL JS 從開源授權換成私有授權了 (也可以參考先前寫的「Mapbox GL JS 的授權改變,以及 MapLibre GL 的誕生」這篇):

In December 2020, Mapbox released the second version of their JavaScript library for online display of maps, Mapbox GL JS. Previously open source code under a BSD license, the new version switched to proprietary licensing. This resulted in a fork of the open source code, MapLibre GL, and initiation of the MapLibre project.

裡面題到了四個工具。

第一個是 OpenMapTiles,下載部份的圖資使用,對於報導只需要某個區域很方便:

OpenMapTiles is an extensible and open tile schema based on the OpenStreetMap. This project is used to generate vector tiles for online zoomable maps. OpenMapTiles is about creating a beautiful basemaps with general layers containing topographic information.

第二個是 Maputnik,可以修改圖資呈現的方式:

A free and open visual editor for the Mapbox GL styles targeted at developers and map designers.

第三個是 PMTiles,可以將圖資檔案塞到一個大檔案裡面,然後透過 HTTP range requests 下載需要的部份就好,大幅下降 HTTP request 所需要的費用 (很多 CDN 會依照 HTTP request 數量收費):

Protomaps is a serverless system for planet-scale maps.

An alternative to map APIs at 1% the cost, via single static files on your own cloud storage. Deploy datasets like OpenStreetMap for your site in minutes.

最後就是 fork 出來開源版本的 maplibre-gl-js

MapLibre GL JS is an open-source library for publishing maps on your websites or webview based apps. Fast displaying of maps is possible thanks to GPU-accelerated vector tile rendering.

It originated as an open-source fork of mapbox-gl-js, before their switch to a non-OSS license in December 2020. The library's initial versions (1.x) were intended to be a drop-in replacement for the Mapbox’s OSS version (1.x) with additional functionality, but have evolved a lot since then.

這樣看起來好像可以用在像 KKTIX 這種下面顯示固定地圖的地方:

關於 twemoji.maxcdn.com 這個網址的一些事情

在「phpBB 3.3.10 Release」這邊看到這個修正,研究了一下發現原來有些故事在後面跑:

Update the emoji CDN: PHPBB3-17071

Twitter Emoji (Twemoji) 這個計畫是 Twitter 弄出來的 open source project,最主要是讓不支援新版 Unicode 的系統上可以改用圖片顯示出來 (畢竟 Unicode 一直在加字)。

其中提供的 url 是 https://twemoji.maxcdn.com/v/latest/twemoji.min.js,可以看到裡面帶有 MaxCDN 的資訊,算是一種廣告,但後來 StackPath 在 2016 併購 MaxCDN,接下來是在去年打算要淘汰掉 MaxCDN 這個產品線:「MaxCDN and SecureCDN are Retiring; Here’s What It Means for You」。

這件事情被帶到「Clarify MaxCDN URLs now that MaxCDN is retiring #556」這邊討論,但看起來沒有太多動作,後來在「Maxcdn has shut down, cdn not working anymore. #580」這邊又被帶起來討論,其中 Twitter 前員工大概提了一下情況,主要是當年他們跟 MaxCDN 有談過讓 MaxCDN 負責頻寬的部份:

@simplexx among all the things, twemoji has always been a Twitter service for the community.

At my times in there, we had a great agreement / deal with MaxCDN so that it's hard to blame the boss this time, as MaxCDN is a completely different company/story.

What I see is some poor attention to this project, as companies don't close from a day to another (usually?) but as we all know what's going on @ twitter, I can't really blame any of my former colleagues, or new arrivals there.

Please let's not make it a wall of shame for all the people that worked on this, thanks for your understanding (I've left 7 years ago or more, as example, I've got pinged by some follower and I'm just trying to help you out anyway).

另外也有人提到目前 Twitter 的溝通管道的狀況:

For what it's worth, while I was still there we were in talks with MaxCDN to have the same deal when they migrated to be Stackpath. Everyone who had worked on the deal with MaxCDN had left and left no record of that deal, so it was taking them a bit longer to work out than expected – MaxCDN used to offer free hosting to OSS projects, but Stackpath wouldn't be doing that. They were working on an exception for us, but any emails they send to our Twitter emails now get bounced, so I guess they could've gotten this sorted out before shutting down MaxCDN... but it's not like we're there on the other end to make it happen, so it appears we'll never know.

Anyway,現在是至少讓 twemoji.maxcdn.com 恢復到「會動」了,但情況變得很特殊,twemoji.maxcdn.com 這個網址目前是指到競業的 BunnyCDN 上:

;; ANSWER SECTION:
twemoji.maxcdn.com.     229     IN      CNAME   twemoji.b-cdn.net.
twemoji.b-cdn.net.      35      IN      A       23.248.177.58

但如果你真的打過去要檔案,會是 301 到 jsDelivr 上:

$ http https://twemoji.maxcdn.com/2/svg/1f525.svg
[...]
Location: https://cdn.jsdelivr.net/npm/twemoji@11.3.0/2/svg/1f525.svg
[...]

而 jsDelivr 目前是放在 Fastly 上:

;; ANSWER SECTION:
cdn.jsdelivr.net.       180     IN      CNAME   jsdelivr.map.fastly.net.
jsdelivr.map.fastly.net. 30     IN      A       199.232.45.229

但在 GitHub 的說明上面則是建議用 UNPKG

<script src="https://unpkg.com/twemoji@latest/dist/twemoji.min.js" crossorigin="anonymous"></script>

而 UNPKG 目前的 CDN 的部份則是 Cloudflare

;; ANSWER SECTION:
unpkg.com.              86400   IN      NS      anirban.ns.cloudflare.com.
unpkg.com.              86400   IN      NS      aron.ns.cloudflare.com.
;; ANSWER SECTION:
unpkg.com.              300     IN      A       104.16.126.175
unpkg.com.              300     IN      A       104.16.123.175
unpkg.com.              300     IN      A       104.16.122.175
unpkg.com.              300     IN      A       104.16.124.175
unpkg.com.              300     IN      A       104.16.125.175

指到 BunnyCDN 但是 BunnyCDN 只負責 redirect,然後也不是導到 UNPKG 上...

這兩個禮拜爆紅的 Stable Diffusion

Stable DiffusionStability AI 訓練出來的 model,跟之前提到的 DALL-E 最大的差異就是產生出的圖的限制少很多:

Unlike competing models like DALL-E, Stable Diffusion is open source and does not artificially limit the images it produces, though the license prohibits certain harmful use cases.

這也造就了這兩個禮拜整個 Stable Diffusion 的各種應用急速成長。

Simon Willison 的「Stable Diffusion is a really big deal」這篇來當作總覽還不錯。

除了授權使用上的限制以外,在技術上的限制也比較少 (有很大一部分會歸功於社群的各種 porting),包括了:

除了先前大家已經熟悉的 txt2img 功能以外,Stable Diffusion 另外提供了 img2img 的能力,也就是先給一張圖,然後再給對應的句子要求 Stable Diffusion 去改這張圖,所以就會有像是把這張圖:

加上「A distant futuristic city full of tall buildings inside a huge transparent glass dome, In the middle of a barren desert full of large dunes, Sun rays, Artstation, Dark sky full of stars with a shiny sun, Massive scale, Fog, Highly detailed, Cinematic, Colorful」的句子後,提供了這張圖:

以及這張圖:

這樣可玩性又多了不少...

從 Android (AOSP) fork 出來的 /e/

上個禮拜在 Hacker News 看到的「Review of /e/ – An Android Alternative For Mobile Phones」,在講 /e/ 這個從 AOSP 改出來的作業系統,主力在於「unGoogled」這件事情,避免任何資料傳回給 Google。Hacker News 上對應的討論在「Review of /e/ – Android-based alternative for mobile phones (thenewleafjournal.com)」這邊。

先看了一下運作方式,/e/ 的後面是 e Foundation,以非營利組織的方式經營。

LineageOS 的經驗來看,看起來有蠻多東西預先包好了,像是預掛了 microG 來模擬 Google Play Services 的服務與 API,這樣可以讓一些需要 Google Play Services 的服務可以跑 (但可以預期不會是完全相容)。

另外也有一些商業合作,所以市場上可以買到出廠就已經安裝 /e/ 的手機,讓一般使用者更容易上手。另外一條可以預期的路是自己刷 /e/,從「Smartphone Selector」這邊可以看到 /e/ 支援很多型號。

文章裡另外題到了其他的 AOSP fork,走不同的路線:

In addition to LineageOS, there are two forks focused primarily on security – GrapheneOS and CalyxOS. There is also Replicant, which appears to mostly support older devices at this time.

看起來弄個 Pixel 5a 或是舊一點的 Pixel 4a 應該是個還可以的方向,Google 自家牌的手機通常都是這些 distribution 優先支援的機種...