Ruff:用 Rust 寫的 Python Linter

Hacker News Daily 上看到「Astral (astral.sh)」這個,網站在「Astral: Next-gen Python tooling」。

裡面提到的 Ruff 專案是一套用 Rust 寫的 Python Linter,主打就是速度,從官網提供的 benchmark 就可以看出來差距:

因為是 Python ecosystem 的東西,安裝可以直接用 pip 裝預設編好的套件,而不需要透過 cargo 自己編 (當然你想要還是可以用 cagro 編)。

feedgen 測了一下,速度是真的快,這樣就比較不會嫌棄了... 要注意會冒出 .ruff_cache/ 目錄,.gitignore 要加一下。

然後用預設值先掃出 unused import 修掉,其他的有機會再看要怎麼改。

Node.js 20

看到 Node.js 推出 20 了,官方的公告:「Node.js 20 is now available!」。

裡面提到的 Permission Model,設計上看起來有點雷?這種東西應該要有白名單機制才對,目前看起來是實做黑名單機制...

然後結尾有提到 14 是這個月收攤,16 則是因為 OpenSSL 1.1.1 EoL,打算切齊而提前到今年九月收 (參考 OpenSSL 官方前陣子發的「OpenSSL 1.1.1 End of Life」):

Also of note is that Node.js 14 will go End-of-Life in April 2023, so we advise you to start planning to upgrade to Node.js 18 (LTS) or Node.js 20 (soon to be LTS).

Please, consider that Node.js 16 (LTS) will go End-of-Life in September 2023, which was brought forward from April 2024 to coincide with the end of support of OpenSSL 1.1.1.

查了 18 會是 2025 年四月底,20 則會是 2026 年四月底...

透過 WebGPU 跑的 Web LLM

Simon Willison 這邊看到的玩法,透過 WebGPU 在瀏覽器上面直接跑 LLM 的 demo:「Web LLM runs the vicuna-7b Large Language Model entirely in your browser, and it’s very impressive」,專案在「Web LLM」這邊,可以直接玩。

不過要注意一下瀏覽器的支援度,如果是 Chrome 的話需要 113+,但目前 stable 還是 112;而 Firefox 的話我試過在 about:config 裡面用 dom.webgpu.enabled 打開 WebGPU 支援,但重開瀏覽器後還是跑不動?(也有可能是 Linux 環境的關係)

Update:應該是 Linux 環境的關係,我在 Linux 下用 dev channel (114) 也不行。

話說有 WebGPU 後是不是開始要擋 GPU 挖礦了...

修改 booking.com 的 dark pattern

Hacker News Daily 上看到修改 booking.comdark pattern 的套件:「De-Stressing Booking.com (alexcharlton.co)」,原文連結到「De-stressing Booking.com」,這是 2019 的文章,在介紹他寫的套件。

裡面講的是像這樣的東西:

這個例子裡面是故意用有壓力的顏色 (這邊是紅色) 去推動使用者趕快下單,算是蠻經典的 dark pattern,作者有舉個 Airbnb 的類似例子,比較起來就好很多:

在 comment 也有人提到其他種類的 dark pattern,故意把一些飯店標成已經售出,製造你不趕快訂就會釘不到的假象。不過下面有人提到,在有些法律制度比較完整的國家裡面,這會牽扯到不實宣傳之類的行為:

After browsing hotels for some time I've seen booking.com show several hotels start to sell out of rooms. That usually causes me to hurry up and book, but after several hotels showed full at once I got suspicious and checked my partners phone. The hotels still showed as available there. Dark stuff. Their website is otherwise pretty good though and I still use them.

在「Online hotel booking」這邊有英國對這些線上訂房網站的調查與裁罰。

然後在 Hacker News 上的 comment 有看到一個有趣的方法,是 PresidentObama 提到的方法 (這 id XDDD),用 uBlock Origin 來擋:

From the last time booking.com was discussed I picked up some ublock origin filters that make the website more bearable.

You can copy and paste them directly in your ublock config (ublock options -> My filters)

  ! https://news.ycombinator.com/item?id=21860328
  booking.com##.soldout_property
  booking.com##.sr_rooms_left_wrap.only_x_left
  booking.com##.lastbooking
  booking.com##.sr--x-times-booked
  booking.com##.in-high-demand-not-scarce
  booking.com##.top_scarcity
  booking.com##.hp-rt-just-booked
  booking.com##.cheapest_banner_content > *
  booking.com##.hp-social_proof
  booking.com##.fe_banner__red.fe_banner__w-icon.fe_banner__scale_small.fe_banner
  booking.com##.urgency_message_x_people.urgency_message_red
  booking.com##.rackrate
  booking.com##.urgency_message_red.altHotels_most_recent_booking
  booking.com##.fe_banner__w-icon-large.fe_banner__w-icon.fe_banner
  booking.com##.smaller-low-av-msg_wrapper
  booking.com##.small_warning.wxp-sr-banner.js-wxp-sr-banner
  booking.com##.lock-price-banner--no-button.lock-price-banner.bui-u-bleed\@small.bui-alert--large.bui-alert--success.bui-alert

另外還有擋一些追蹤的 url parameter:

Apart from these, I use some additional ublock filters to block some of their tracking that I am not ok with.

$removeparam=/^(error_url|ac_suggestion_theme_list_length|ac_suggestion_list_length|search_pageview_id|ac_click_type|ac_langcode|ac_position|ss_raw|from_sf|is_ski_area|src|sb_lp|sb|search_selected|srpvid|click_from_logo|ss|ssne|ssne_untouched|b_h4u_keep_filters|aid|label|all_sr_blocks|highlighted_blocks|ucfs|arphpl|hpos|hapos|matching_block_id|from|tpi_r|sr_order|srepoch|sr_pri_blocks|atlas_src|place_types)/,domain=booking.com
  $removeparam=/sid=.\*;BBOX/,domain=booking.com
  ||www.booking.com/c360/v1/track
  ||www.booking.com/fl/exposed
  ||booking.com/personalisationinfra/track_behaviour_property
  ||booking.com/has_seen_review_list

不過好像很少用 booking.com 了...

目前可商用的 LLM

Ask Hacker News Weekly 上看到的討論,有人問了目前可商用的 LLM 有哪些:「Ask HN: Open source LLM for commercial use?」。

有人提到 GoogleFlan 應該是目前最能打的?在 Hugging Face 上可以下載到:

I've seen this question asked repeatedly in many LLaMa threads, currently the best models that are truly open are the released models from the Flan family by Google, which includes Flan-T5[0] and Flan-UL2[1]. According to its paper, Flan-UL2 performs slightly better than Flan-T5-XXL.

然後差不多是 GPT-3 的等級,離 GPT-3.5 或是演伸出來的 ChatGPT 都還有段距離。但如果針對特定情境下 tune 的話應該還是能用的:

These models perform slightly better than GPT-3 under some tasks[2], but they're still far from achieving the results from GPT-3.5 and GPT-4. This becomes evident when you try to use them in the real world; they're not "good enough" for general use cases, unlike ChatGPT models. However, if you can restrict your use case to one particular domain, you can achieve pretty good results by further fine-tuning these models.

另外一則回覆有提到一些其他的 model:

The ones I saw mentioned so far were Flan, Cerebras, GPT-J, and RWKV.

Not yet mentioned:

* Pythia https://github.com/EleutherAI/pythia

* GLM-130B https://github.com/THUDM/GLM-130B - see also ChatGLM-6B https://github.com/THUDM/ChatGLM-6B

* GPT-NeoX-20B https://huggingface.co/EleutherAI/gpt-neox-20b

* GeoV-9B https://github.com/geov-ai/geov

* BLOOM https://huggingface.co/bigscience/bloom and BLOOMZ https://huggingface.co/bigscience/bloomz

看起來如果有需要用的話是可以從這裡面挖看看...

SQL:2023 的新玩意

Hacker News 上看到「SQL: 2023 is finished: Here is what's new (eisentraut.org)」這篇題到了 SQL:2023 標準的新東西,對應的原文在「SQL:2023 is finished: Here is what's new」這邊。

「UNIQUE null treatment (F292)」讓你可以決定 NULL 到底要不要算 unique,剛好跟之前寫過的「PostgreSQL 15 將可以對透過 UNIQUE 限制 NULL 的唯一性了」要做的事情一樣。

「ORDER BY in grouped table (F868)」則是針對沒有出現在 SELECT 的欄位頁可以 ORDER BY,看了一下說明,主要是在 JOIN 的時候限制住了。很明顯的 workaround 是多加上這個欄位,但就代表會增加傳回的資料量。

「GREATEST and LEAST (T054)」這個因為 MIN()MAX() 已經被 aggregate function 用掉了,所以只好另外取名。

「String padding functions (T055)」與「Multi-character TRIM functions (T056)」是熟悉的語法,各家都有對應的 function 可以做,但這次就放進標準化。

「Optional string types maximum length (T081)」是 VARCHAR 可以不用指定大小了,實務上應該是還好?

「Enhanced cycle mark values (T133)」這編提到的 recursive 真的是每次用每次忘,然後 cycle 這個功能就沒看懂了...

「ANY_VALUE (T626)」看起來可以隨機取出資料,搭配 GROUP BY '' 就不用拿 ORDER BY RAND() 這種髒髒的東西出來了?

「Non-decimal integer literals (T661)」與「Underscores in numeric literals (T662)」都是讓數字更好讀以及操作。

後面講了很多 JSON 功能,看起來是 SQL:2016 有先納入一些,但 SQL:2023 補的更完整了。

然後有 Graph 相關的標準也被定義進 SQL:2023,原文介紹的也不是很多,看起來是要跨足過來?

MySQL 5.7 的支援只到今年十月 (Oct 2023)

剛剛翻資料才看到 OracleMySQL 5.7 的支援原來只剩下半年了,預定在 2023 年十月中止:「Oracle Technology Products - Oracle Lifetime Support Policy」。

隔壁棚 Percona 包的 Percona Server for MySQL 5.7 可以從「Percona Release Lifecycle Overview」這邊查,看起來也設定一樣的時間 (2023 年十月),但不確定會不會宣佈延長,至少提供 security fix 之類的。

一直沒注意,突然發現只剩下半年...

MongoDB 的替代方案 FerretDB 推出 1.0 (GA) 版本

Hacker News 上看到 FerretDB 推出 1.0 (GA) 版本:「FerretDB: open-source MongoDB alternative (ferretdb.io)」,原文在「Announcing FerretDB 1.0 GA - a truly Open Source MongoDB alternative」這邊。

當初有寫過「MangoDB 改名為 FerretDB (雪貂)」這篇,但沒注意到他們成立公司來開發?在「Careers at FerretDB」這邊可以看到 hiring 的訊息。

官網有整理出目標,像是他們提到不是以 drop-in replacement 為目標,而是實做核心功能與常用的功能,涵蓋大多數的使用者:

Is FerretDB 100% compatible with MongoDB?

It is not necessary, nor it is feasible to implement every single MongoDB feature out there. Our aim is to cover the core feature set of MongoDB, and then continue adding features which could enhance the experience or increase application compatibility. Non-OSS alernatives of MongoDB are similar in this sense, eg. none of these products are able to provide the full feature set of MongoDB. We are aiming to please 85% of MongoDB users, not all of them.

但這樣也讓想換的人會有一些顧慮... 而且這邊的 85% 不知道是怎麼喊出來的?

把 RabbitMQ 換成 PostgreSQL 的那篇文章...

Hacker News 上看到「SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres Queue (prequel.co)」這篇文章,原文在「SQL Maxis: Why We Ditched RabbitMQ And Replaced It With A Postgres Queue」這邊,裡面在講他們把 RabbitMQ 換成 PostgreSQL 的前因後果。

文章裡面可以吐嘈的點其實蠻多的,而且在 Hacker News 上也有被點出來,像是有人就有提到他們遇到了 bug (或是 feature) 卻不解決 bug,而是決定直接改寫成用 PostgreSQL 來解決,其實很怪:

In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

另外一個吐嘈的點是量的部份,如果就這樣的量,用 PostgreSQL 降低使用的 tech stack 應該是個不錯的決定 (但另外一個問題就是,當初為什麼要導入 RabbitMQ...):

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.

在同一個 thread 下面也有人提到這個量真的很小,甚至直接不講武德提到可以用 Jenkins 解 XD:

Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.

然後意外看到 Simon Willison 提到了一個重點,就是 RabbitMQ 到現在還是不支援 ACID 等級的 job queuing (尤其是 Durability 的部份),也就是希望 MQ 系統回報成功收到的 task 一定會被處理:

The best thing about using PostgreSQL for a queue is that you can benefit from transactions: only queue a job if the related data is 100% guaranteed to have been written to the database, in such a way that it's not possible for the queue entry not to be written.

Brandur wrote a great piece about a related pattern here: https://brandur.org/job-drain

He recommends using a transactional "staging" queue in your database which is then written out to your actual queue by a separate process.

這也是當年為什麼用 MySQL 幹類似的事情,要 ACID 的特性來確保內容不會掉。

這也是目前我覺得唯一還需要用 RDBMS 當 queue backend 的地方,但原文公司的想法就很迷,遇到 library bug 後決定換架構,而不是想辦法解 bug,還很開心的寫一篇文章來宣傳...

用 RSS-Bridge 接服務

查資料的時候發現 RSS-Bridge 這個用 PHP 寫的專案,直接找個 PHP hosting 架起來就可以用了,沒有什麼其他的需求。

簡單架起來測了一輪,看起來不賴啊,如果一般人要用的話可以考慮就用這個專案就好,量很少的人可以用官方列出來的 Public instances 玩一下,量多的人可以自己架,PHP hosting 還蠻好找的,官方要求要 7.4+,注意一下 PHP hosting 提供的版本應該不會有太多問題。

自己寫的 feedgen 比較偏順便練 Python,不過當時的確是還不知道有這樣的專案,看了一下 GitHub 上的 tag 記錄,2013 就有的專案...