在 virt-manager 裡面的 Windows 發出聲音

用了幾個月才發現沒聲音,花了點時間找方法測試才解決。

一般人用預設值應該是沒問題,不過我的環境比較特別,剛好不是預設值,我的 PulseAudio 輸出到 DA&T C-13 時是設定成 96kHz/24bit,而一般 Windows 預設輸出時是 44.1kHz/16bit,兩邊對不起來就沒聲音了。

找了一些文章,後來是在「Virtual machine audio setup – or how to get pulse audio working」這篇裡面講到用 virsh 直接改 XML 設定檔的方式讓 qemu 輸出到 PulseAudio 裡面轉,也就是文章裡面 Option 2 的部分。

這樣 guest OS 裡面還是跑 44.1/16,但外面輸出的時候就固定是 96/24 了。

比較奇怪的是如果 guest 裡面跑 96/16 的話聲音會破破的,不過我只是要有聲音 (不然也不會用了好幾個月才發現),品質的問題倒是還好...

Fabrice Bellard 的新作品 TSAC codec

昨天在 Hacker News 上看到「TSAC: Low Bitrate Audio Compression (bellard.org)」這則 Fabrice Bellard 的新作品:「TSAC: Very Low Bitrate Audio Compression」。

Fabrice Bellard 這兩年玩很多 ML 相關的東西,像是 2021 的「LibNC: C Library for Tensor Manipulation」,以及 2023 的「ts_zip: Text Compression using Large Language Models」,這次則是用上了 transformer

tsac is based on a modified version of the Descript Audio Codec extended for stereo and a Transformer model to further increase the compression ratio. Both models are quantized to 8 bits per parameter.

翻了一下 Descript Audio Codec,已經算是很厲害的了,在 44.1KHz mono 的情況下可以做到 8kbps (~1KB/sec):

With Descript Audio Codec, you can compress 44.1 KHz audio into discrete codes at a low 8 kbps bitrate.

TSAC 則是將 mono 的版本做到 5.5kbps,另外 7.5kbps 就能傳雙聲道,也還是低於原來的 8kbps:

TSAC is an audio compression utility reaching very low bitrates such as 5.5 kb/s for mono or 7.5 kb/s for stereo at 44.1 kHz with a good perceptual quality. Hence TSAC compresses a 3.5 minute stereo song to a file of 192 KiB.

聽了一下算蠻厲害的,直接拿音樂來壓還是有不錯的效果?

從技術上的情境可以知道兩端都需要有足夠的算力 (得跑 ML algorithm),然後在頻寬很不足的情況下通訊?商用的情境好像偏少一點,現在連沙漠與深山中都有衛星可以用,也許還是有些情境下用的到,像是 LoRa 的速度就差不多在這個區間。

倒是軍事上面需要考慮不少極端情況,用的機會可能比較高?

Web Audio API 當做 fingerprint 的方式

三年前的文章「How the Web Audio API is used for audio fingerprinting」講解了 AudioContext 是怎麼被拿來 fingerprint 的,最近在「How We Bypassed Safari 17's Advanced Audio Fingerprinting Protection」這篇看到的。

AudioContext 可以完全跟錄音設備無關,單純計算,然後因為不同瀏覽器實作上面有差異,就被拿來當作 fingerprint 了。

文章裡介紹的方法是透過 Oscillator 產生 440Hz 的正弦波,然後過 Compressor 降低音量 (運算):

The Web Audio API provides a DynamicsCompressorNode, which lowers the volume of the loudest parts of the signal and helps prevent distortion or clipping.

降低音量的運算再這塊各家的實作不同,就能夠區分不同的瀏覽器 (甚至是版本):

Historically, all major browser engines (Blink, WebKit, and Gecko) based their Web Audio API implementations on code originally developed by Google in 2011 and 2012 for the WebKit project.

Since then browser developers have made a lot of small changes. These changes, compounded by the large number of mathematical operations involved, lead to fingerprinting differences. Audio signal processing uses floating point arithmetic, which also contributes to discrepancies in calculations.

Additionally, browsers use different implementations for different CPU architectures and OSes to leverage features like SIMD. For example, Chrome uses a separate fast Fourier transform implementation on macOS (producing a different oscillator signal) and other vector operation implementations on different CPU architectures (used in the DynamicsCompressor implementation). These platform-specific changes also contribute to differences in the final audio fingerprint.

而這東西平常也不會用到,所以對 Tor Browser 這種特別重視 privacy 的瀏覽器就直接關掉他了:

Tor

In the case of the Tor browser, everything is simple. But unfortunately, web Audio API is disabled there, so audio fingerprinting is impossible.

歌曲與專輯名稱的各種變化

Hacker News 上看到「Horrible edge cases to consider when dealing with music (2022) (dustri.org)」這篇,講音樂產業各種奇怪種奇怪的 case,2022 年的原文在這邊:「Horrible edge cases to consider when dealing with music」,裡面有蠻大的篇幅都在講名稱的問題...。

這讓我想到這則推:

基本上會遇到各種 escaping 與 UTF-8 的處理,另外文章裡面提到會遇到超長名稱,也就是 VARCHAR(255) 的剋星... 不確定之前在 K 社大量取得授權的時候有沒有遇到這張 1999 年的專輯 (下面這是一張專輯的名字):

When the Pawn Hits the Conflicts He Thinks Like a King What He Knows Throws the Blows When He Goes to the Fight and He'll Win the Whole Thing 'Fore He Enters the Ring There's No Body to Batter When Your Mind Is Your Might So When You Go Solo, You Hold Your Own Hand and Remember That Depth Is the Greatest of Heights and If You Know Where You Stand, Then You Know Where to Land and If You Fall It Won't Matter, Cuz You'll Know That You're Right.

其他文章裡面提到的應該都算是見怪不怪的情境,像是單曲七個小時... 這種在 server side 還拼的過去,但在行動平台上面比較苦命,直接把整曲都放進 memory 裡面的話有可能會炸。

320kbps 七個小時差不多是 1GB,如果沒注意這點,直接處理 DRM 的話瞬間就會吃到 2GB RAM;如果拉到 CD quality 的資料量就更明顯了。

然後你會發現,把音樂創作人的任何資料都當作 untrusted input 的態度來設計系統 (但你不能 reject),問題通常就不會太大 XD

把 Sennheiser HD 555 升級成 HD 595 的故事

Hacker News 上看到的,只用一隻螺絲起子,就把 Sennheiser HD 555 升級成 HD 595 的方法:「sennheiser hd 555 to hd 595 mod」。

This page will show you how to turn a $199.95 (Canadian – Suggested Retail) pair of Sennheiser HD 555 headphones into a pair of Sennheiser HD 595‘s that cost $349.95. And all you need is a screwdriver.

兩者的差異只在 HD 555 多了一片泡綿 (foam),把他拆出來就好了:

Aside from the aesthetic differences, the only physical difference was an additional piece of foam inside the cheaper HD555 headphones, blocking about 50% of the outside-facing vents. Since both the HD 555 and HD 595 are designed to be “open” headphones, reducing the vent with this foam would alter the frequency response slightly. So to save yourself $150, open your HD 555’s up and remove the foam. Done.

作者說是注意的到的差別:

Yes. The actual sound difference is very slight, but it is noticeable.

在 Hacker News 上的討論「Sennheiser HD 555 to HD 595 Mod (mikebeauchamp.com)」裡面有在猜什麼原因,有可能是硬拉產品線,也有可能是將次級品改裝,但看起來兩個機體本身是相同的沒錯...

不過這兩隻都是老機了,看起來現在沒有再繼續生產。

直接用 prompt 產生音樂的 Riffusion

很紅的 Stable Diffusion 是寫一串文字 (prompt) 然後產生圖片,而 Riffusion 則是寫一串文字產生音樂。

其中 prompt 轉成音樂其實還在可以預期的範圍 (i.e. 遲早會出現),但專案的頁面上解釋了 Riffusion 是基於 Stable Fusion 的作品,而且是利用 Stable Fusion 產生出時頻譜 (spectrogram):

Well, we fine-tuned the model to generate images of spectrograms, like this:

也就是像這樣的圖:

Hacker News 上討論時的討論頁可以看看,作者有參與一些討論:「Riffusion – Stable Diffusion fine-tuned to generate music (riffusion.com)」。

其中有人提到這個作法超出想像,因為輸出的圖片只要幾個 pixel 差一點點就會產生出很不同的聲音:

This really is unreasonably effective. Spectrograms are a lot less forgiving of minor errors than a painting. Move a brush stroke up or down a few pixels, you probably won't notice. Move a spectral element up or down a bit and you have a completely different sound. I don't understand how this can possibly be precise enough to generate anything close to a cohesive output.

Absolutely blows my mind.

然後其中一位作者回覆到,他也是做下去後才很意外發現居然可行:

Author here: We were blown away too. This project started with a question in our minds about whether it was even possible for the stable diffusion model architecture to output something with the level of fidelity needed for the resulting audio to sound reasonable.

實際上聽了產生出來的音樂,是真的還 OK 的音樂... 大家都完全沒想到可以這樣搞,然後在 Hacker News 上的 upvote 數量爆炸高 XD

Framework 筆電也遇到缺料問題,換了音源晶片

Framework 的筆電最近在社群很紅,模組化設計讓維修變容易,而且也有許多規格上的客製化空間。在「Marketplace」這頁可以看到很多東西可以換,除了比較常見的無線網卡、SSD、記憶體以外,像是主機板、鍵盤甚至連 USB、HDMI 接口都是模組。

不過這邊要提到的是 audio chip 也在這波 supply chain 的供貨問題而中招了:「Solving for Silicon Shortages」,Hacker News 上的討論「Framework: Solving for Silicon Shortages (frame.work)」也可以看一下。

從文章裡看起來是 Realtek ALC295 的交期爆炸了:

Chips that would normally have 16-20 week lead times (meaning we’d place typically binding orders that far ahead of needing parts in our hands) went up to 52 weeks. In one case, we even got notified of a 68 week lead time on a chip!

We were able to get enough Realtek ALC295 audio CODECs to develop the Framework Laptop and get through the first few months of production, but nowhere near enough to fulfill ongoing demand from the US and Canada, let alone the additional countries we’d like to ship to.

所以決定換到 Tempo 92HD95B

Luckily, we were able to find an alternative CODEC that lets us stay in production: the Tempo 92HD95B.

查了一下 datasheet,本來的 Realtek ALC295 是 QFN-48,而 Tempo 92HD95B 是 QFN-40,看起來得改不少東西... 應該是連 open market 上都翻不到而被迫換設計,跟我們家的情況也很像,看起來最近大家都哭到爆炸了 :o

Google 新推出的 Lyra audio codec

Hacker News Daily 上看到「Lyra audio codec enables high-quality voice calls at 3 kbps bitrate」,講 Google 新推出的 Lyra audio codec:「Lyra: A New Very Low-Bitrate Codec for Speech Compression」,論文在「Generative Speech Coding with Predictive Variance Regularization」這邊可以抓到。

目前 Google 提出來的想法是想辦法在 56kbps 的頻寬下實現還堪用的視訊通話:

Pairing Lyra with new video compression technologies, like AV1, will allow video chats to take place, even for users connecting to the internet via a 56kbps dial-in modem.

這次的突破在於可以使用 3kbps 的頻寬傳輸,但清晰度比 Opus 的 6kbps 效果還好不少。

Google 在文章裡面給了兩個 sample,一個是乾淨背景音,另外一個是吵雜的背景音,跟 Opus 與 Speex 比起來都好很多。

論文是說不需要太高的運算力,但沒翻到 GitHub 之類的 source code,先當作參考:

We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity.

Zoom 的浮水印功能

Hacker News Daily 上看到 The Intercept 介紹了 Zoom 的浮水印功能,以及如果你要洩密的話要如何自保:「What You Should Know Before Leaking a Zoom Meeting」。這篇文章主要不是談 Zoom 之前被討論的那些問題,而就 Zoom 的浮水印功能來討論。

Zoom 支援 video watchmark 與 audio watchmark:

依照描述的兩個方式,看起來都不難破,但主要是要提醒記者,如果要放出線人提供的 Zoom 錄音或是錄影,要注意到裡面是否有 watchmark 導致線人的資訊被洩漏:

Journalists should also be wary of publishing raw audio leaked from Zoom meetings, particularly if the source is not sure whether audio watermarking was enabled or not.

翻了一下 GitHub 沒搜到有工具可以處理,這點可能要等人發展出來...

Amazon Transcribe 可以吃其他格式了

Amazon TranscribeAWS 推出語音轉文字的服務,先前只有提供 WAVFLACMP3MP4 格式,現在則是多支援不少格式:

Today, we are excited to announce native support for media files in AMR, AMR-WB, Ogg and WebM format by Amazon Transcribe.

AMRAMR-WB 以前還蠻常看到的,最近比較少看到了,可能是專利加上選擇性多之後用的人就變少了。

再來是 OggWebM 兩個都是開放格式。

上次拿 Amazon Transcribe 測日文的影片,先用 FFmpeg 把 MP4 檔內的 audio track 抽出來再丟上去轉,轉完後用 andyhopp/aws-transcribe-to-srt 把 Amazon Transcribe 輸出的 JSON 再轉成 SRT 檔,就辨識正確度測起來算是堪用,但專有名詞 (像是人名) 就得另外處理,不過比什麼都沒有好不少...