估算 YouTube 影片總量的方式

Hacker News Daily 上看到「How big is YouTube? (ethanzuckerman.com)」這篇,原文在「How Big is YouTube?」。

算是個老問題了,而且應該是統計學上比較簡單的方法。先列出作者最後的成果:「TubeStats」。

作者用的方法是觀察 YouTube 的 vid:

Here’s how this works: YouTube URLs look like this: https://www.youtube.com/ watch?v=vXPJVwwEmiM

可以分析出來 vid 包括了 64-bit 的資訊,這個資料型態對工程師來說,看起來就很像是 uniformly distributed:

That bit after “watch?v=” is an 11 digit string. The first ten digits can be a-z,A-Z,0-9 and _-. The last digit is special, and can only be one of 16 values. Turns out there are 2^64 possible YouTube addresses, an enormous number: 18.4 quintillion. There are lots of YouTube videos, but not that many. Let’s guess for a moment that there are 1 billion YouTube videos – if you picked URLs at random, you’d only get a valid address roughly once every 18.4 billion tries.

然後就是隨機去產生 vid 去掃,這個方法跟 drunk dialing 的行為很像,算是 random sampling 的方式:

We refer to this method as “drunk dialing”, as it’s basically as sophisticated as taking swigs from a bottle of bourbon and mashing digits on a telephone, hoping to find a human being to speak to. Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often. Kevin Zheng wrote a whole bunch of scripts to do the dialing, and over the course of several months, we collected more than 10,000 truly random YouTube videos.

另外在 2011 年就有提出來利用 autocomplete 機制去算:

By comparing our results to other ways of generating lists of YouTube videos, we can declare them “plausibly random” if they generate similar results. Fortunately, one method does – it was discovered by Jia Zhou et. al. in 2011, and it’s far more efficient than our naïve method. (You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.) Kevin now polls YouTube using the “dash method” and uses the results to maintain our dashboard at Tubestats.

目前他們的預估大約是 13B 左右的影片,換算大約是用掉 33.63 bits 了 (233.6):

In our case, our drunk dials tried roughly 32k numbers at the same time, and we got a “hit” every 50,000 times or so. Our current estimate for the size of YouTube is 13.325 billion videos – we are now updating this number every few weeks at tubestats.org.

而這邊提到的 32768 * 50k 會中一次的部分,這邊的大約是 30.61 bits,這樣加起來是差不多 64 bits 沒錯。

不過要注意的是,他們沒有給出 interval,所以 13B 的上下可能是一倍左右的差距 (6.5B~26B 之類的),這邊的數字當作概念比較好...

擋 YouTube 短影音的設定

短影音類的影片因為沒有知識量 (沒有 reference 可以確認正確性),我完全不會看... 但 YouTube 上一般的影音我會翻,所以就會冒出這個需求了。

YouTube 的短影音有幾個地方會出現 (補充一下,我這邊是用英文版介面):

  • 首頁的左邊,會有一個 Shorts 的連結可以點進 Shorts 看。
  • 首頁的推薦裡面也會有 Shorts 的 section。

這兩個情況用這兩條擋:

www.youtube.com##a[title="Shorts"]
www.youtube.com##ytd-rich-section-renderer

這邊要注意的是,後者除了擋掉 Shorts 以外,還會擋掉各種 YouTube 的推銷 (像是電影之類的),這個也是我要擋的,所以我這邊直接用了 ytd-rich-section-renderer 這個元素來擋。

再來是各種穿插在頁面裡面的 Shorts 內容,像是首頁、訂閱頁與搜尋結果頁,這些就要找出對應的元素來擋:

www.youtube.com##ytd-reel-shelf-renderer
www.youtube.com##ytd-video-renderer:has(a[href*="/shorts/"])

另外一個跟短影音無關,但還是很影響專注度的是,YouTube 的搜尋結果會給你一堆很干擾結果的推薦,像是「People also watched」、「For you」、「Previously watched」以及「From related searches」,也可以設定擋掉:

www.youtube.com##:matches-path(/results) ytd-shelf-renderer[thumbnail-style]

目前用的差不多是這些...

最近 YouTube 在阻擋擋廣告軟體的事情

最近用 uBlock Origin 或是其他擋廣告軟體的都加減會遇到 YouTube 阻擋擋廣告軟體的事情,然後在 HN 上面看到「Youtube’s Anti-adblock and uBlock Origin」,對應的討論在「YouTube's Anti-Adblock and uBlock Origin (andadinosaur.com)」這邊。

先說一下目前的解法可以參考「YouTube Anti-Adblock and Ads - October 16, 2023 (Weekly Thread)」這篇說明,目前的建議就是每天睡醒後更新一下清單 (block list),讓清單是最新的。

這邊的清單不需要全部更新,只需要針對「uBlock filters – Quick fixes」這組就可以了,方法是點一下他的時鐘部分,然後再按上面的 Update Now 更新就可以了 (我是英文版介面,中文版的話應該是中文的...):

基本上就是個貓抓老鼠的過程,最近更新的會很頻繁。

另外一個最近開始用,也很推薦的套件「SponsorBlock for YouTube - Skip Sponsorships」,可以跳過業配片段。

下載 YouTube 影片的技術限制與繞過方法

Hacker News 上看到這篇「How They Bypass YouTube Video Download Throttling」在講 YouTube 防止下載的各種方式。

透過 API 拿到的 URL 直接抓很慢,大約 40-70KB/sec:

However, attempting to download from this URL leads to really slow download:

The speed is always limited to around 40-70kB/s.

這邊需要一個 javascript 環境計算出 n,帶入後續的 request 以「證明」你是官方的網頁 client:

Since mid-2021, YouTube has included the query parameter n in the majority of file URLs. This parameter needs to be transformed using a JavaScript algorithm located in the file base.js, which is distributed with the web page. YouTube utilizes this parameter as a challenge to verify that the download originates from an “official” client. If the challenge is not resolved and n is not transformed correctly, YouTube will silently apply throttling to the video download.

The JavaScript algorithm is obfuscated and changes frequently, so it’s not practical to attempt reverse engineering to understand it. The solution is simply to download the JavaScript file, extract the algorithm code, and execute it by passing the n parameter to it. The following code accomplishes this.

但即使算出 n,也還是會限速,可以看到作者策出來大約是 4MB/sec,雖然比以前快很多了,但還是看得出來有限速。這主要是避免 client 端過度 buffer 浪費頻寬:

With this new URL containing the correctly transformed n parameter, the next step is to download the video. However, YouTube still enforces a throttling rule. This rule imposes a variable download speed limit based on the size and length of the video, aiming to provide a download time that’s approximately half the duration of the video. This aligns with the streaming nature of videos. It would be a massive waste of bandwidth for YouTube to always provide the media file as quickly as possible.

接下來的方式就是利用 Range 拆成很多個 HTTP request 打,這樣因為 buffering algorithm 在開始限速前會先全速塞資料給你,就可以用這點避開限速的問題了。

把多的 request 與處理時間都算進去後,整體大約可以到 50-70MB/sec,算是可以接受的下載速度了:

However, the average speeds typically ranged between 50-70 MB/s or 400-560 Mb/s, which is still pretty fast.

後面有一些合併處理的指令 (因為 YouTube 會把影與音分離成兩個檔案),就不是重點了...

用 YouTube 影片當作免空的方式

Hacker News 上看到「Infinite-Storage-Glitch – Use YouTube as cloud storage for any files (github.com/dvorakdwarf)」這個討論,裡面的專案在 DvorakDwarf/Infinite-Storage-Glitch 這邊。

這種搞法有點像是以前 Love machine 的玩法,記得當年是和信這樣玩去塞爆 HiNetTWIX 中間的頻寬,算是老故事了。

看起來他的作法是透過 2x2 的黑白 pixel 儲存,然後讓 YouTube 壓縮,最終 YouTube 生出來的 mp4 檔案大概是四倍大:

Use the embed option on the archive (THE VIDEO WILL BE SEVERAL TIMES LARGER THAN THE FILE, 4x in case of optimal compression resistance preset)

另外真的是比較學術面的討論的話,有 Information hiding 這個主題 (不過這次這個專案沒在演),在討論要怎麼「藏」資訊在媒體載體上。

不過現在有不少直接做免空服務的,這種方式算是好玩而已,「實用性」已經沒以前那麼高了。

ARC (Authenticated Received Chain)

標題的 ARC 是指 Authenticated Received Chain,是前陣子在 Hacker News 上看到「Gmail accepts forged YouTube emails (john-millikin.com)」這篇才發現的東西,原文在「Gmail accepts forged YouTube emails」這邊。

作者發現 Gmail 收了從不是直接從 YouTube 發出來的信件:

主要的原因是,Gmail 除了使用標準的 SPFDKIM 判斷外,還吃上面提到的 ARC。

查了一下 ARC,標準是 RFC 8617,目前還是被標成 experimental,主打是解決 forwarding 的問題,看了一下作者群是 LinkedIn (Microsoft)、GoogleValimail

ARC 這東西與之前 Google 在強推的 AMP (然後被罰) 以及現在在推的 Signed HTTP Exchanges 都有相同的味道,無視 security & privacy concern 的東西...

Google 在歐盟的服務將提供 Reject All Cookies 的按鈕

看到「Google gives Europe a ‘reject all’ button for tracking cookies after fines from watchdogs」這篇,在講 Google 在歐盟的服務開始提供 Reject All Cookie 的按鈕,其中 Google 官方的公告可以在「New cookie choices in Europe」這邊看到。

Reject All Cookies 的按鈕是像這樣的設計:

照報導說的,今年初的時候法國罰了 Google 一億五千萬歐元,因為 Accept All Cookies 只要一個按鍵,但 Reject All Cookies 需要按很多選單才能達成,而法國認為這樣非對稱式的設計是違法的:

Earlier this year, France’s data protection agency CNIL fined Google €150 million ($170 million) for deploying confusing language in cookie banners. Previously, Google allowed users to accept all tracking cookies with a single click, but forced people to click through various menus to reject them all. This asymmetry was unlawful, said CNIL, steering users into accepting cookies to the ultimate benefit of Google’s advertising business.

Google 的說明裡面也有提到法國的事情,但當然沒有提到罰款:

Based on these conversations and specific direction from France’s Commission Nationale de l’Informatique et des Libertés (CNIL), we have now completed a full redesign of our approach, including changes to the infrastructure we use to handle cookies.

另外就是這個功能目前只在法國啟用,後續會放到整個歐盟區:

We’ve kicked off the launch in France and will be extending this experience across the rest of the European Economic Area, the U.K. and Switzerland.

美國人使用社群媒體的情況

在「Social Media Usage by Age」這邊看到的文章,把美國人使用社群媒體的情況做成圖,資料來源是 Pew Research Center 的「Social Media Fact Sheet」這裡。

很明顯的可以看到 Google (Alphabet) 基本上就是 YouTube 一個產品吃天下,而 Facebook (Meta) 有三個產品在滲透,包括 Facebook、InstagramWhatsapp

LinkedIn 在出社會後會開始用,另外 Pinterest 這麼多老人家在用到是很驚奇 XDDD

社群維護的 YouTube Private API 套件

一樣是今天的 Hacker News Daily 上看到的東西,透過 YouTube 的 Private API 操作 YouTube 的套件:「Youtube.js – full-featured wrapper around YouTube's private API (github.com/luanrt)」。

這些 Private API 就是 YouTube 自己在網站上用的:

A full-featured wrapper around the Innertube API, which is what YouTube itself uses.

也因為這不是 Public API,也就不需要申請 key:

Do I need an API key to use this?

No, YouTube.js does not use any official API so no API keys are required.

當然可以預期他會無預警壞掉,所以可以自己衡量一下要怎麼搞...

比較有趣的是 Hacker News 的討論裡面反而有人在問要怎麼偵測這種 library 或是 bot XDDD

If you’re YouTube or any site, and want to stop these sort of wrappers - what’s the easiest way to do so without breaking your own site?

I find this task to be an interesting engineering problem.

A related question is if there’s an unspoofable way to detect a client.

不過掃了一下好像還好...

用 uBlock Origin 的自訂條件把 YouTube 的 Shorts 都拔掉

先拔掉 YouTube 訂閱頁裡面的 shorts 連結:

www.youtube.com##:matches-path(/feed/subscriptions) ytd-grid-video-renderer:has(a[href^="/shorts"])

再來拔掉首頁左邊出現的 shorts 分類連結:

www.youtube.com##ytd-mini-guide-entry-renderer:has(a[title="Shorts"])

必須說 :has() 真的是很好用...