目前可商用的 LLM

Ask Hacker News Weekly 上看到的討論,有人問了目前可商用的 LLM 有哪些:「Ask HN: Open source LLM for commercial use?」。

有人提到 GoogleFlan 應該是目前最能打的?在 Hugging Face 上可以下載到:

I've seen this question asked repeatedly in many LLaMa threads, currently the best models that are truly open are the released models from the Flan family by Google, which includes Flan-T5[0] and Flan-UL2[1]. According to its paper, Flan-UL2 performs slightly better than Flan-T5-XXL.

然後差不多是 GPT-3 的等級,離 GPT-3.5 或是演伸出來的 ChatGPT 都還有段距離。但如果針對特定情境下 tune 的話應該還是能用的:

These models perform slightly better than GPT-3 under some tasks[2], but they're still far from achieving the results from GPT-3.5 and GPT-4. This becomes evident when you try to use them in the real world; they're not "good enough" for general use cases, unlike ChatGPT models. However, if you can restrict your use case to one particular domain, you can achieve pretty good results by further fine-tuning these models.

另外一則回覆有提到一些其他的 model:

The ones I saw mentioned so far were Flan, Cerebras, GPT-J, and RWKV.

Not yet mentioned:

* Pythia https://github.com/EleutherAI/pythia

* GLM-130B https://github.com/THUDM/GLM-130B - see also ChatGLM-6B https://github.com/THUDM/ChatGLM-6B

* GPT-NeoX-20B https://huggingface.co/EleutherAI/gpt-neox-20b

* GeoV-9B https://github.com/geov-ai/geov

* BLOOM https://huggingface.co/bigscience/bloom and BLOOMZ https://huggingface.co/bigscience/bloomz

看起來如果有需要用的話是可以從這裡面挖看看...

原來有專有名詞:TOCTOU (Time-of-check to time-of-use)

看「The trouble with symbolic links」這篇的時候看到的專有名詞:「TOCTOU (Time-of-check to time-of-use)」,直翻是「先檢查再使用」,算是一個常見的 security (hole) pattern,因為檢查完後有可能被其他人改變,接著使用的時候就有可能產生安全漏洞。

在資料庫這類環境下,有 isolation (ACID 裡的 I) 可以確保不會發生這類問題 (需要 REPEATABLE-READ 或是更高的 isolation level)。

但在檔案系統裡面看起來不太順利,2004 年的時候研究出來沒有 portable 的方式可以確保避免 TOCTOU 的問題發生:

In the context of file system TOCTOU race conditions, the fundamental challenge is ensuring that the file system cannot be changed between two system calls. In 2004, an impossibility result was published, showing that there was no portable, deterministic technique for avoiding TOCTOU race conditions.

其中一種 mitigation 是針對 fd 監控:

Since this impossibility result, libraries for tracking file descriptors and ensuring correctness have been proposed by researchers.

然後另外一種方式 (比較治本) 是檔案系統的 API 支援 transaction,但看起來不被主流接受?

An alternative solution proposed in the research community is for UNIX systems to adopt transactions in the file system or the OS kernel. Transactions provide a concurrency control abstraction for the OS, and can be used to prevent TOCTOU races. While no production UNIX kernel has yet adopted transactions, proof-of-concept research prototypes have been developed for Linux, including the Valor file system and the TxOS kernel. Microsoft Windows has added transactions to its NTFS file system, but Microsoft discourages their use, and has indicated that they may be removed in a future version of Windows.

目前看起來的問題是沒有一個讓 Linux community 能接受的 API 設計?

Google 與 Oracle 對 Java API 爭議的案子

前幾天應該很多媒體都有報導了,這邊算是整理一下看到的資料。

美國最高法院公佈的全文在「18-956_d18f.pdf」這邊可以看到,算是最重要的資料。

另外很多地方也有更新,像是維基百科上面的條目「Google LLC v. Oracle America, Inc.」。

這次的案件在軟體產業也很關注,難得可以在 Hacker News 上看到 upvote 超過四千的新聞:「Google’s copying of the Java SE API was fair use [pdf] (supremecourt.gov)」,不過裡面的討論我覺得就是鄉民拿著爆米花的感覺...

第一個重要的消息當然是 6-2 認定 fair use,並且讓聯邦法院重審 (但最高法院已經把最重要的部份拍板定案了),不過要注意的是,對於更基本的問題「API 是否有著作權」並沒有定案:

In April 2021, the Supreme Court ruled in a 6–2 decision that Google's use of the Java APIs fell within the four factors of fair use, bypassing the question on the copyrightability of the APIs. The decision reversed the Federal Circuit ruling and remanded the case for further review.

判決全文 PDF 的前面三頁多算是簡介說明這次的重點,Page 44 到 Page 62 則是反對的兩位大法官 (Clarence ThomasSamuel Alito) 所提出的異議,可以看到兩位大法官批評了 copyrightability 與 fair-use analysis 的問題。

這次的結果對軟體與網路產業影響超級大,舉個例子來說,一堆公司都有推出與 Amazon S3 相容 API 的產品 (這邊是 Network-based API)。另外 Firefox 直接拿 Chromium 的 Manifest 格式來相容降低開發者開發 extension 的成本。

之後應該可以看到大家用的更爽了...

Google Cloud Platform 也開始玩 Reserved Instances 的想法了 (Committed use discounts)

看到 Google Cloud PlatformGoogle Cloud Engine 也開始提供 RI 的想法了:「Committed use discounts」。

本來就有 Sustained use discounts,當你用到一定的量時就會自動折扣,不需要人工介入,不過對於 7x24 開機的 instance 來說,能得到的折扣還是比不上 AWS 的 RI。

這次 Committed use discounts 折扣的方式跟 Amazon EC2 類似,一樣是一年與三年。計價方式不同的原因是因為 GCE 提供 custom type,使用者可以自己選 vCPU 與 Memory,所以折扣的方式也是圍繞在這兩個選擇上規劃。

不過小台機器是不提供折扣的,這點就不像 AWS 上所有種類的機器都可以買 RI:

You can only use committed use discounts for predefined machine types and custom machine types. Small machine types, such as f1-micro and g1-small, are not eligible for committed use discounts.

目前是 beta:

This is a Beta release of Committed Use Discounts. This feature is not covered by any SLA or deprecation policy and may be subject to backward-incompatible changes.

STARTTLS 的不完整性以及大規模監控電子郵件

在「Don’t count on STARTTLS to automatically encrypt your sensitive e-mails」這邊提到了 STARTTLS 的問題,引用「Neither Snow Nor Rain Nor MITM ... An Empirical Analysis of Email Delivery Security」這篇論文的說明。

SMTP 裡 STARTTLS 的設計雖然可以加密,但仲所皆知,可以阻擋 EHLO 回應結果避免建立 STARTTLS 連線,而讓發送端改用傳統未加密的 SMTP 傳輸。而研究發現其實目前就有大規模的這種監控行為:

可以看到突尼西亞的監控情況遠超過想像...

目前的想法是發展一套類似 HSTS 的 Trust on first use 設計,也許在這份報告出來後可以加速催生...

Google 的書本掃描服務被認定為「合理使用」

Google 的書本掃描服務被認定為合理使用:「Google's Book-Scanning Project Ruled to Be Legal `Fair Use'」。

“Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality and display of snippets from those works are non-infringing fair uses,” U.S. Circuit Judge Pierre Leval wrote on behalf of the court. “The purpose of the copying is highly transformative, the public display of text is limited and the revelations do not provide a significant market substitute for the protected aspects of the originals.”

看起來是一路打到第二巡迴上訴法院了?(負責紐約地區)

第九巡迴上訴法院:DMCA takedown notification 必須先確認是否為合理使用 (Fair Use)

出自 EFF 的「Takedown Senders Must Consider Fair Use, Ninth Circuit Rules」這篇,案件可以參考「Lenz v. Universal Music Corp.」這篇,或是 EFF 整理的「Lenz v. Universal」這篇,由 EFF 發起訴訟控告環球侵犯合理使用權:

The Electronic Frontier Foundation (EFF) filed suit against Universal Music Publishing Group (UMPG) asking a federal court to protect the fair use and free speech rights of a mother who posted a short video of her toddler son dancing to a Prince song on the Internet.

起因在於 Stephanie Lenz 上傳了一段 29 秒的影片,背景有 Let's Go Crazy 這首歌的音樂,而被環球發 DMCA takedown notification 下架:

Stephanie Lenz's 29-second recording shows her son bouncing along to the Prince song "Let's Go Crazy " which is heard playing in the background. Lenz uploaded the home video to YouTube in February to share it with her family and friends.

後來 Stephanie Lenz 發出 counter notification 並且控告環球濫用 DMCA notification:

In late June 2007, Lenz sent YouTube a counter-notification, claiming fair use and requesting the video be reposted. Six weeks later, YouTube reposted the video. In July 2007, Lenz sued Universal for misrepresentation under the DMCA and sought a declaration from the court that her use of the copyrighted song was non-infringing. According to the DMCA 17 U.S.C. § 512(c)(3)(A)(v), the copyright holder must consider whether use of the material was allowed by the copyright owner or the law.

而環球直接挑明不在意 fair use:

In September 2007, Prince released statements that he intended to "reclaim his art on the internet." In October 2007, Universal released a statement amounting to the fact that Prince and Universal intended to remove all user-generated content involving Prince from the internet as a matter of principle.

於是雙方就從 2007 年開始一路打官司,首先的判決是地方法院認為 DMCA takedown 必須確認侵權事實才能發,這包括了要確認 fair use:

The district court held that copyright owners must consider fair use before issuing DMCA takedown notices. Thus, the district court denied Universal's motion to dismiss Lenz's claims, and declined to dismiss Lenz's misrepresentation claim as a matter of law.

同時認為環球濫用 DMCA takedown notification:

The district court believed that Universal's concerns over the burden of considering fair use were overstated, as mere good faith consideration of fair use, not necessarily an in-depth investigation, is sufficient defense against misrepresentation. The court also explained that liability for misrepresentation is crucial in an important part of the balance in the DMCA.

然後就是一路往上打,打到前幾天第九巡迴上訴法院宣佈維持原來判決定案。這是官方放出的 PDF:「UNITED STATES COURT OF APPEALSFOR THE NINTH CIRCUIT (PDF)」。Summary 的部份提到這次判決的結論:

The panel held that the DMCA requires copyright holders to consider fair use before sending a takedown notification, and that failure to do so raises a triable issue as to whether the copyright holder formed a subjective good faith belief that the use was not authorized by law.

這個判決使得目前使用機器自動無條件送 takedown notification 的程式也會受到規範,後續看 EFF 怎麼出招了...

.onion 的域名保護

.onion 被用在 Torhidden service,而現在從不同的面向要保護這個 root domain 不被註冊,在 IETF 的 blog 上看到「.onion」這篇文章就是其中一個方向。

這邊的計畫是把 .onion 域名當作像是 .local.localhost.example 這樣的特殊域名保護 (參考 RFC 6761「Special-Use Domain Names」) 而提了一個新的 RFC (目前是 draft):「The .onion Special-Use Domain Name」。

如果通過的話,就有一個標準可以遵循,不然現在對 .onion 一直都是 De-facto standard...

維基百科的使用條款更新,強制揭露利益衝突問題

維基百科昨天的使用條款修訂公告中,提到了「揭露利益衝突」的問題:「Making a change to our Terms of Use: Requirements for disclosure」,這份文件的最後方有簡體中文版的說明,對於看英文比較不通順的人可以先看中文版的說明。

在新版的「Terms of Use」裡面,有一個專門的章節「Paid contributions without disclosure」:

These Terms of Use prohibit engaging in deceptive activities, including misrepresentation of affiliation, impersonation, and fraud. As part of these obligations, you must disclose your employer, client, and affiliation with respect to any contribution for which you receive, or expect to receive, compensation. You must make that disclosure in at least one of the following ways:

  • a statement on your user page,
  • a statement on the talk page accompanying any paid contributions, or
  • a statement in the edit summary accompanying any paid contributions.

這段修正可以從「Difference between revisions of "Terms of Use" - Wikimedia Foundation」這邊看到完整的 diff。

這是對於「付費編輯」的反制:國外甚至有專門收費找人編輯維基百科的公司在運作 (可以參考 2013 年 10 月的「Wikimedia Foundation Executive Director Sue Gardner’s response to paid advocacy editing and sockpuppetry」這篇文章),這次在使用條款內直接增訂這一部份,將本來只是社群規範的項目變成直接上法院反制。

早該這麼做了,這件事情意義重大...