npm 被拿來放影片...

這邊講的 npm 是指 official repository... 在 Lobsters 上看到在 npm 上放 ts 檔的文章:「npm flooded with 748 packages that store movies」,原文在這:「npm flooded with 748 packages that store movies」。

發現問題的是 wlwz 這個 user (不確定什麼時候會不見,不過備份在 Internet ArchiveArchive Today 上都有)。

以列出來的第一個 wlwz-2312-1405 來看,可以從裡面看到十個 ts 檔:

找了一下字串,看起來中國那邊有人認出來是「武林外傳」了 (縮寫剛好就是 wlwz):「npm 被用来保存盗版视频」。

這算是是各種 hosting 需要面對的問題...

1990 年代俄羅斯人用 VHS 帶 (錄影帶) 備份數位資料的方法:ArVid

Hacker News 上看到「ArVid: how Russians squeezed 4 hard drives into one VHS tape in the 90s」這篇,在 1990 年代俄羅斯人發明了用 VHS 帶 (錄影帶) 備份數位資料的方式,這個套件叫做 ArVid

方法是利用家裡已經有的 VHS 機 (錄影機),然後在 386 的電腦上接一張 ISA 介面的卡 (對比現在的電腦環境就是 PCI-e 介面卡),然後把 ISA 卡接到 VHS 機的 Video In (負責備份資料) 與 Video Out (負責取回資料),另外 ISA 卡還有一個紅外線 LED 發射的模組線可以接到 (貼到) VHS 機器的接收處,這樣可以讓 ISA 卡透過「遙控器」的協定控制 VHS 播放器。

這個點子用的媒介其實類似於磁帶機,只是 ArVid 為了使用現成的 VHS 機,多了一個轉換成影像的步驟。

這邊 ArVid 加上了 Hamming code,提供之後讀取時,發現錯誤以及修正的能力。

三個小時的 VHS 帶可以存 2GB 的資料,這個空間大小的感覺拉一下「History of hard disk drives」這頁的資訊,可以感覺一下 1990 年代前期時這樣的東西大概是什麼感覺:

1990 – IBM 0681 "Redwing" – 857 megabytes, twelve 5.25-inch disks. First HDD with PRML Technology (Digital Read Channel with 'partial-response maximum-likelihood' algorithm).

1991 – Areal Technology MD-2060 – 60 megabytes, one 2.5-inch disk platter. First commercial hard drive with platters made from glass.

1991 – IBM 0663 "Corsair" – 1,004 megabytes, eight 3.5-inch disks; first HDD using magnetoresistive heads

1991 – Intégral Peripherals 1820 "Mustang" – 21.4 megabytes, one 1.8-inch disk, first 1.8-inch HDD

1992 – HP Kittyhawk – 20 MB, first 1.3-inch hard-disk drive

是個很有趣的產品啊...

估算 YouTube 影片總量的方式

Hacker News Daily 上看到「How big is YouTube? (ethanzuckerman.com)」這篇,原文在「How Big is YouTube?」。

算是個老問題了,而且應該是統計學上比較簡單的方法。先列出作者最後的成果:「TubeStats」。

作者用的方法是觀察 YouTube 的 vid:

Here’s how this works: YouTube URLs look like this: https://www.youtube.com/ watch?v=vXPJVwwEmiM

可以分析出來 vid 包括了 64-bit 的資訊,這個資料型態對工程師來說,看起來就很像是 uniformly distributed:

That bit after “watch?v=” is an 11 digit string. The first ten digits can be a-z,A-Z,0-9 and _-. The last digit is special, and can only be one of 16 values. Turns out there are 2^64 possible YouTube addresses, an enormous number: 18.4 quintillion. There are lots of YouTube videos, but not that many. Let’s guess for a moment that there are 1 billion YouTube videos – if you picked URLs at random, you’d only get a valid address roughly once every 18.4 billion tries.

然後就是隨機去產生 vid 去掃,這個方法跟 drunk dialing 的行為很像,算是 random sampling 的方式:

We refer to this method as “drunk dialing”, as it’s basically as sophisticated as taking swigs from a bottle of bourbon and mashing digits on a telephone, hoping to find a human being to speak to. Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often. Kevin Zheng wrote a whole bunch of scripts to do the dialing, and over the course of several months, we collected more than 10,000 truly random YouTube videos.

另外在 2011 年就有提出來利用 autocomplete 機制去算:

By comparing our results to other ways of generating lists of YouTube videos, we can declare them “plausibly random” if they generate similar results. Fortunately, one method does – it was discovered by Jia Zhou et. al. in 2011, and it’s far more efficient than our naïve method. (You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.) Kevin now polls YouTube using the “dash method” and uses the results to maintain our dashboard at Tubestats.

目前他們的預估大約是 13B 左右的影片,換算大約是用掉 33.63 bits 了 (233.6):

In our case, our drunk dials tried roughly 32k numbers at the same time, and we got a “hit” every 50,000 times or so. Our current estimate for the size of YouTube is 13.325 billion videos – we are now updating this number every few weeks at tubestats.org.

而這邊提到的 32768 * 50k 會中一次的部分,這邊的大約是 30.61 bits,這樣加起來是差不多 64 bits 沒錯。

不過要注意的是,他們沒有給出 interval,所以 13B 的上下可能是一倍左右的差距 (6.5B~26B 之類的),這邊的數字當作概念比較好...

Netflix 放出了 2023 上半年一萬八千部的播放統計資料

在「What We Watched: A Netflix engagement report (netflix.com)」這邊看到的,Netflix 的文章在「What We Watched: A Netflix Engagement Report」這邊,標題提到的 Excel 報告在 What_We_Watched_A_Netflix_Engagement_Report_2023Jan-Jun.xlsx 這邊。

Hacker News 上的留言 id=38621625 有提到這是 Writers Guild of America 的聯合罷工 (參考英文維基百科的「2023 Writers Guild of America strike」或是中文維基百科「2023年美國編劇協會大罷工」) 所協商出來的成果:

This is an outcome of the WGA strike negotiations. Now writers (and actors, and anyone else) can use this information to better negotiate their worth with studios, rather than it being 1-sided. All other streaming services should be following suit soon.

在「What We Won」這邊可以看到關於 transparency 的部分:

Streaming data transparency: Companies agree to provide the Guild, subject to a confidentiality agreement, the total number of hours streamed, both domestically and internationally, of self-produced high budget streaming programs (e.g., a Netflix original series). Aggregated information can be shared.

打開 Excel 檔可以看到 Netflix 就放出最低限度的資料,但就如同 comment 提到的,這份資料以及足以讓很多人有機會反過來談更好的合約。

另外一方面,也可以預期這份公開資料交叉其他的 metadata 可以分析出一些有趣的東西?

Twitch 宣佈退出韓國市場

Twitch 宣佈 2024/02/27 (星期二) 退出韓國市場:「An Update on Twitch in Korea」。日期不知道是怎麼選的,可能跟某些合約有關?

Twitch 目前的公告會有繁體中文,也可以看這份:「Twitch 韓國現況更新」。

另外今天早上找了一下,Hacker News 也有討論了:「An update on Twitch in Korea (twitch.tv)」。

目前官方給出來的理由是虧本,而且找不到方法克服虧本的問題:

Ultimately, the cost to operate Twitch in Korea is prohibitively expensive and we have spent significant effort working to reduce these costs so that we could find a way for the Twitch business to remain in Korea.

這邊提到的包括了 p2p model 以及降到 720p,但即使如此網路費用 (應該就是頻寬費用) 是其他區域的十倍以上:

First, we experimented with a peer-to-peer model for source quality. Then, we adjusted source quality to a maximum of 720p. While we have lowered costs from these efforts, our network fees in Korea are still 10 times more expensive than in most other countries. Twitch has been operating in Korea at a significant loss, and unfortunately there is no pathway forward for our business to run more sustainably in that country.

Cloudflare 這邊,2016 年還叫做 CloudFlare 的時候也有抱怨過:「CloudFlare 對 HiNet 成本的抱怨 (還有其他 ISP...)」。

當年是這樣寫 HiNetKT,成本大約是歐美區的 15 倍:

Two Asian locations stand out as being especially expensive: Seoul and Taipei. In these markets, with powerful incumbents (Korea Telecom and HiNet), transit costs 15x as much as in Europe or North America, or 150 units.

而尤其是韓國的部分,政府介入讓降價的速度比全世界慢,所以時間拉長後成本相較於其他地區就貴很多:

South Korea is perhaps the only country in the world where bandwidth costs are going up. This may be driven by new regulations from the Ministry of Science, ICT and Future Planning, which mandate the commercial terms of domestic interconnection, based on predetermined “Tiers” of participating networks. This is contrary to the model in most parts of the world, where networks self-regulate, and often peer without settlement. The government even prescribes the rate at which prices should decrease per year (-7.5%), which is significantly slower than the annual drop in unit bandwidth costs elsewhere in the world. We are only able to peer 2% of our traffic in South Korea.

不過不確定現在的情況,2016 年的 CloudFlare 跟 2023 年的 Cloudflare 已經差了七年了...

最近 AV1 的支援度

HN 上「AV1 video codec gains broader hardware support (fullystacked.net)」這篇在說 AV1 的支援度變得更好了,原文不長,在「The AV1 video codec gains broader hardware support」這邊。

Can I Use 上的 AV1 video format 可以看的比較清楚:

不過在瀏覽器上離直接取代掉其他的 video codec 還早,但算是個起頭,至少 iPhone 15 Pro 與 iPhone 15 Pro Max 上的 Safari 支援了,接下來就是看桌機的 Edge 什麼時候才又想到要把 AV1 開回來:

Edge has stopped supporting AV1 completely at some point prior to version 116 (additional information required).

擋 YouTube 短影音的設定

短影音類的影片因為沒有知識量 (沒有 reference 可以確認正確性),我完全不會看... 但 YouTube 上一般的影音我會翻,所以就會冒出這個需求了。

YouTube 的短影音有幾個地方會出現 (補充一下,我這邊是用英文版介面):

  • 首頁的左邊,會有一個 Shorts 的連結可以點進 Shorts 看。
  • 首頁的推薦裡面也會有 Shorts 的 section。

這兩個情況用這兩條擋:

www.youtube.com##a[title="Shorts"]
www.youtube.com##ytd-rich-section-renderer

這邊要注意的是,後者除了擋掉 Shorts 以外,還會擋掉各種 YouTube 的推銷 (像是電影之類的),這個也是我要擋的,所以我這邊直接用了 ytd-rich-section-renderer 這個元素來擋。

再來是各種穿插在頁面裡面的 Shorts 內容,像是首頁、訂閱頁與搜尋結果頁,這些就要找出對應的元素來擋:

www.youtube.com##ytd-reel-shelf-renderer
www.youtube.com##ytd-video-renderer:has(a[href*="/shorts/"])

另外一個跟短影音無關,但還是很影響專注度的是,YouTube 的搜尋結果會給你一堆很干擾結果的推薦,像是「People also watched」、「For you」、「Previously watched」以及「From related searches」,也可以設定擋掉:

www.youtube.com##:matches-path(/results) ytd-shelf-renderer[thumbnail-style]

目前用的差不多是這些...

Raspberry Pi 5 拿掉硬體的 H.264 encoding

HN 上看到「Raspberry Pi 5 has no hardware video encoding and only HEVC decoding (raspberrypi.com)」,原文指到 Gordon Hollingworth 的回覆這邊,可以看到這是上個月的消息了。

Raspberry Pi 一直都有硬體 H.264 encoding 的能力,不過這個在 Raspberry Pi 5 上被拿掉了,所以得用軟體壓。

官方有提到 Raspberry Pi 5 的 CPU 因為比之前快很多,單顆就有辦法做到 1080p60 (而 RPi5 有 4 CPU),所以除了 power consumption 以外應該不是大問題:

Obviously, the bad thing is the power consumption, but actually it only takes around 1 processor to encode 1080p60 with our default settings (which is still better quality than the PI 4 hardware encoder).

不過 HN 上猜是授權費用之類的問題,我在想是不是新的晶片組的 encoding license 都綁在一起,H.264 + H.265 得一包一起買,而 H.265 的授權費是眾所皆知的貴...

當 desktop 應該是還好,但就心裡有個底...

下載 YouTube 影片的技術限制與繞過方法

Hacker News 上看到這篇「How They Bypass YouTube Video Download Throttling」在講 YouTube 防止下載的各種方式。

透過 API 拿到的 URL 直接抓很慢,大約 40-70KB/sec:

However, attempting to download from this URL leads to really slow download:

The speed is always limited to around 40-70kB/s.

這邊需要一個 javascript 環境計算出 n,帶入後續的 request 以「證明」你是官方的網頁 client:

Since mid-2021, YouTube has included the query parameter n in the majority of file URLs. This parameter needs to be transformed using a JavaScript algorithm located in the file base.js, which is distributed with the web page. YouTube utilizes this parameter as a challenge to verify that the download originates from an “official” client. If the challenge is not resolved and n is not transformed correctly, YouTube will silently apply throttling to the video download.

The JavaScript algorithm is obfuscated and changes frequently, so it’s not practical to attempt reverse engineering to understand it. The solution is simply to download the JavaScript file, extract the algorithm code, and execute it by passing the n parameter to it. The following code accomplishes this.

但即使算出 n,也還是會限速,可以看到作者策出來大約是 4MB/sec,雖然比以前快很多了,但還是看得出來有限速。這主要是避免 client 端過度 buffer 浪費頻寬:

With this new URL containing the correctly transformed n parameter, the next step is to download the video. However, YouTube still enforces a throttling rule. This rule imposes a variable download speed limit based on the size and length of the video, aiming to provide a download time that’s approximately half the duration of the video. This aligns with the streaming nature of videos. It would be a massive waste of bandwidth for YouTube to always provide the media file as quickly as possible.

接下來的方式就是利用 Range 拆成很多個 HTTP request 打,這樣因為 buffering algorithm 在開始限速前會先全速塞資料給你,就可以用這點避開限速的問題了。

把多的 request 與處理時間都算進去後,整體大約可以到 50-70MB/sec,算是可以接受的下載速度了:

However, the average speeds typically ranged between 50-70 MB/s or 400-560 Mb/s, which is still pretty fast.

後面有一些合併處理的指令 (因為 YouTube 會把影與音分離成兩個檔案),就不是重點了...

雲端上面的 GPU 資源費用,以及地端的 GPU 決策圖

Hacker News 上面看到「Cloud GPU Resources and Pricing (fullstackdeeplearning.com)」這篇,原網頁是「Cloud GPUs - The Full Stack」,裡面有些有用的資源可以拉出來獨立看。

雲端的選擇上,因為 H100 看起來還沒普及,所以用上一代的 A100 (80GB) 來看,可以看到大的雲端跟其他家的差異還是蠻大的:

不過這邊好像沒把 vast.ai 放進來。

地端的資訊主要是直接購買顯示卡時的選擇,可以看到如果除了各系列的旗艦卡外 (4090 & 3090 & 2080),3060 是一張會在考慮到「便宜」而上榜的卡,應該是因為他是一張入門價位的顯卡,卻有 12GB VRAM 的關係:

在接下來七月要推出的 4060 會出 16GB VRAM 版本,應該會取代現在 3060 12GB VRAM 的地位...