Google 提出的 GoogleSQL (Pipe 版本的 SQL 改良)

看到「SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL」這個研究投稿,PDF 檔案在 1004848.pdf 這邊。

Google 提出了 GoogleSQL 改善本來 SQL 的可讀性問題,另外也對 SQL optimizer 更有幫助。

直接拿 PDF 裡面的例子來說明,把本來是這樣的 SQL:

SELECT c_count, COUNT(*) AS custdist
FROM
  ( SELECT c_custkey, COUNT(o_orderkey) c_count
    FROM customer
    LEFT OUTER JOIN orders ON c_custkey = o_custkey
         AND o_comment NOT LIKE '%unusual%packages%'
    GROUP BY c_custkey
  ) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;

變成這樣的:

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;

可以看到這個語法除了變好讀以外,也指示了 SQL optimizer 怎麼去過濾與組合資料。

依照 Google 的說明,GoogleSQL 已經在許多 Google 的系統上實作了:

We’ve implemented pipe syntax in GoogleSQL, the SQL dialect and implementation shared across all1 SQL systems at Google including F1, BigQuery, Spanner and Procella, and the open-source release, ZetaSQL. GoogleSQL is a shared, reusable component, enabling many systems to share the same SQL dialect. This shared component allowed implementing pipe syntax in one place and then enabling it across many products.

裡面提到的 ZetaSQL 可以在 pipe-syntax.md 這邊看到。

我記得其他家也有類似的東西,PRQL 也是在解決類似的問題,不過語法走向不太一樣。

可以看看怎麼收斂...

goo.gl 將在 2025/08/25 停止轉址

堆了一陣子的消息,Google 將在 2025/08/25 停止 goo.gl 轉址服務:「Google URL Shortener links will no longer be available」。

2019 就已經停止新的網址了:「Transitioning Google URL Shortener to Firebase Dynamic Links」,要維持這個服務對 Google 來說不會花太多力氣 (以 Google 的規模來說),但因為這對於賺錢 (尤其是 Ad) 沒有直接與間接幫助,還是被砍掉了...

看了一下 ArchiveTeam 的資料,應該是 URLTeam 這個計畫,看起來 2019 年就有人拉了一包了。

不過好像不太好找,他們應該是丟到 Internet Archive 上,不過 Internet Archive 上面找 goo-gl 看起來有點散?不確定丟上去的邏輯是什麼...

用 Google Calendar 當作範例,設計資料庫的結構

Hacker News 上看到「Database Design for Google Calendar: A Tutorial (databasedesignbook.com)」這個,原文在「Database Design fo Google Calendar: a tutorial」這,這邊的資料庫指的是關聯式資料庫 (RDBMS),像是常見的 MySQL 或是 PostgreSQL

而有趣的點是在 Hacker News 的討論,不過還是先帶一下原文的內容。

原文主要就是怎麼設計以及怎麼下 SQL query 會比較好,另外有些「作者經驗」說明為什麼這樣設計,不過這部份很吃場景,所以當作故事聽過去比較好,自己遇到的時候需要通盤考慮;另外有些地方提到的是 trade off,沒有最佳解。

然後回到 Hacker News 上面,在 id=41044011 這邊算給你看,幹嘛用 RDBMS 設計一堆架構:

A random event from my calendar serializes to 740 bytes of iCalendar. An extremely busy calendar containing one event every 15 minutes 9am-5pm 7 days per week only contains 11680 entries for a full year, which works out to around 8MB.

Would suggest instead of designing a schema at all, a calendar is a good example of a problem that might be far better implemented as a scan. Optimizing an iCalendar parser to traverse a range of dumped events at GB/sec-like throughputs would mean the above worst-case calendar could be scanned in single-digit milliseconds.

Since optimizing a parser is a much simpler problem to solve once than changing or adding to a bad data model after it has many users, and that the very first task involving your new data model is probably to write an iCalendar importer/exporter anyway, I think this would be a really great trade off.

因為每個 entry 也不到 1KB (740 bytes),就算是超級忙的 calendar,每週七天的 9am~5pm 都是每 15mins 一個會議的情況下,全滿也才 8MB... XDDDDDDDD

另外幫他補充說明,他這邊省略掉的是 calendar 接近於 share-nothing 架構,所以每個人直接拆開來獨立存放很容易 scale,你可以用個 scalable key-value solution 存 JSON 就好。

不過文章的原作者主要的目的是找一個大家熟悉但又有點複雜性的東西,示範 database schema 設計的考慮,我不會直接說文章原作者錯啦...

但的確一開始看到文章不會想到暴力解還蠻... 暴力的?XDDD

Meta 的 Llama 3.1

Meta 發佈了 Llama 3.1:「Introducing Llama 3.1: Our most capable models to date」,這本來就只是個發佈而已,但讓我注意到的是 AWSGCP 都同時宣佈在雲端上支援 Llama 3.1 了:

這代表 Meta 在 Llama 3.1 發表前就先跟 AWS & GCP 合作了,這看起來是一個包圍 OpenAI (以及微軟) 的姿態,之前好像沒看到這樣?(單純印象...)

Google Docs 支援 Markdown 的匯入與匯出

看到「Import and export Markdown in Google Docs」這個,掃了一下 Hacker News,上面已經有討論了:「Import and Export Markdown in Google Docs (googleblog.com)」。

id=40983113 這邊提到沒有 ``` (code block) 可以用,只有基本的 ` (inline code)。

不過 ``` 算是後來的延伸語法,拿原版的 Markdown 規格,也就是「Markdown: Syntax」這邊的定義來說,code block 是定義成 indent 4 spaces,但實際測試看起來也沒有支援...

這對於拿 Markdown 來寫技術文件的人來說缺了一大塊...

2012 年在 Google 時強制使用統一標準的 BUILD 檔案

Hacker News 上看到「Reformatting 100k Files at Google in 2011 (le-brun.eu)」這篇,原文在「The Story of Reformatting 100k Files at Google in 2012」這邊。

這篇除了作者寫的東西以外,Russ Cox 也在 Hacker News 上面分享了一些當時的背景。

故事是 2012 年的時候,作者 Laurent Le Brun (LinkedIn) 在德國,剛加入 Google 沒多久,負責處理 Google 內部的編譯工具 Blaze (外部稱 Bazel):

Back in September 2012, I was a junior engineer at Google, working on Bazel (Google’s build tool, also known internally as Blaze).

然後他與他的 team lead 收到美國團隊兩位工程師的會議邀請:

One day, a mysterious calendar invite landed in my inbox. It was sent by two engineers in the US, and I was invited along with my team lead.

然後這兩位是 Rob Pike 以及 Russ Cox:

I quickly recognized the names: Rob Pike and Russ Cox. Though I hadn't worked with them, I knew them by reputation: Russ Cox because I enjoyed reading his blog posts, and Rob Pike because, well… he’s famous.

作者這邊也懶得介紹 Rob Pike 了,畢竟是當年 Bell LabsPlan 9 的頭頭,然後又是 UTF-8 的發明人,後來在 Google 裡面也是 Go (程式語言) 的頭頭... 順便一提 Go 的吉祥物是他老婆畫的。

他們兩個人希望把全公司所有的 BUILD 檔案格式統一:

During the meeting, Rob and Russ shared their ambitious plan: to reformat every single Bazel BUILD file in Google’s codebase and enforce this formatting with a presubmit script.

中間有提到一些工程上的進行方式,但對我來說比較重要的是這幾個,首先是 Russ Cox 提到他在 Go 的團隊裡面學到的,(在大公司裡面) 不要讓工程師浪費時間在 formatting 之類的問題:

The formatting issue is a problem that should not exist and that we care enough about to spend our own engineering time eliminating it. I understand that it does not seem a priori that automated formatting would make a significant difference. I didn't truly understand it myself until we eliminated it from the Go code review and code editing processes. Hopefully in six months or a year, when everything is converted and we look back at this, the benefits will be clearer in retrospect.

另外在 update 的地方作者也提到了當時決定一次到位,把所有的 BUILD 檔案都修正完,而不是讓後續的人在更新時才強制要求,這樣可以避免後面的人接手的時候產生大量的 diff 而導致 review 困難:

Why not enable the new formatter without updating all the files? This would lead to long diffs whenever someone modifies a file, making the change hard to review. The cost would be spread over all the engineers in the company.

Google Public DNS 接受法國法院的阻擋要求

看到「Google, Cloudflare & Cisco Will Poison DNS to Stop Piracy Block Circumvention」這篇,法國在 2022 年通過的體育法律反過來干涉 ISP 或是服務提供商需要配合阻擋:

Tampering with public DNS is a step too far for many internet advocates but for major rightsholders, if the law can be shaped to allow it, that’s what will happen. In this case, Article L333-10 of the French Sports Code (active Jan 2022) seems capable of accommodating almost anything.

拿文章裡面提到的 footybite.cc 測試,實際在法國開一台 Vultr 的 VPS 測試各家 Public DNS 服務,看起來目前 Google Public DNS 已經實作了,而且傳回了 RFC 8914: Extended DNS Errors 內的 EDE 16:

$ dig footybite.cc @8.8.8.8

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; EDE: 16 (Censored): (The requested domain is on a court ordered copyright piracy blocklist for FR (ISO country code). To learn more about this specific removal, please visit https://lumendatabase.org/notices/41606068.)
;; QUESTION SECTION:
;footybite.cc.                  IN      A

目前拿 1.1.1.1 (Cloudflare)、9.9.9.9 (Quad9) 以及 208.67.222.222 (OpenDNS) 都還沒有看到被擋。

另外實際測試,自己架設 Unbound 看起來就可以繞過去了,不知道後續會不會要求更多,像是直接要求在 internet backbone 上面過濾 DNS?(當年推 DNS over TLSDNS over HTTPS 總算要派上用場了?)

另外就是看 Cloudflare 以及其他 Public DNS 服務有沒有反對的動作...

Google 停用了大量與中國與俄羅斯相關的帳號

在「Google Takes Down Influence Campaigns Tied to China, Indonesia, and Russia」這邊看到的,Google 的說明則是在「TAG Bulletin: Q2 2024」這邊,看起來像是例行性的更新?

與台灣有關的當然就是跟中國相關的影響,也是被停最多帳號的,在報告的最後提到 YouTubeBlogger 上面有掃到上千個與中國政府相關的宣傳帳號:

We terminated 1,320 YouTube channels and 1,177 Blogger blogs as part of our ongoing investigation into coordinated influence operations linked to the People’s Republic of China (PRC). The coordinated inauthentic network uploaded content in Chinese and English about China and U.S. foreign affairs. These findings are consistent with our previous reports.

第二多的則是俄羅斯:

We terminated 378 YouTube channels as part of our investigation into coordinated influence operations linked to Russia. The campaign was linked to a Russian consulting firm and was sharing content in Russian that was supportive of Russia and critical of Ukraine and the West.

其他的就比較零頭了...

Google 承認搜尋引擎內部 API 文件洩漏了?

前幾天很熱門的「An Anonymous Source Shared Thousands of Leaked Google Search API Documents with Me; Everyone in SEO Should See Them」消息,裡面提到有拿到一份疑似 Google 搜尋引擎的內部 API 文件, 可以證實有很多 Google 搜尋引擎的運作與 Google 對外宣稱的不符。

作者找了業內人士幫忙分析 (他自己說他先前也在碰 SEO 這塊,但是已經離開這個行業六年了):

Next, I needed help analyzing and deciphering the naming conventions and more technical aspects of the documentation. I’ve worked with APIs a bit, but it’s been 20 years since I wrote code and 6 years since I practiced SEO professionally.

包括作者的一些 Google 朋友,或是 ex-Googler 都確認這份文件符合 Google 內部的文件規範要求,另外裡面的元素編排也都很像是 Google 的文件。

本來以為事情大概就這樣,後續應該就是會有很多人從這份文件分析 Google 有哪些 SEO 的偏好,找出哪些東西與 Google 宣稱的不符。

不過事情突然有個意外的轉折,Google 本來一直是「拒絕評論」的態度,但突然承認這份文件的確是他們內部文件:「Google confirms the leaked Search documents are real」。

"We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information," Google spokesperson Davis Thompson told The Verge in an email. "We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation."

反正也沒有人會相信 outdated 了,但可以預想的是 Google 的搜尋結果應該又會變差,因為會有更多 SEO 垃圾開始想辦法衝排名上去...

用 udm=14 拿掉 Google Search 的一堆附加功能

在「&udm=14 | the disenshittification Konami code」這邊看到的,裡面有提到「How I Made Google’s “Web” View My Default Search」這篇說明。

我測了一下「how many people in the earth」這個搜尋條件,結果是這樣 (w/ uBlock Origin):

加上 &udm=14 後會拿掉很多「輔助」的區塊:

有些人可能偏好前者,但有些人偏好後者,可以自己選擇... (不過我應該會繼續用 Kagi)