Slack 自己幹的 Cron System

在 HN 上看到「Executing Cron Scripts Reliably at Scale (slack.engineering)」,發現是去年九月的文章:「Executing Cron Scripts Reliably At Scale」(話說 Slack 的 engineering blog 可讀性變差好多,不過這又是另外一回事了...)。

夠大的組織的 cron job system 都會自己幹一套出來用,因為檯面上的都不好用 XD

Slack 的搞法是組合三個內部系統:

  • 一個 container-based 管理實際執行資源的系統,基於 Bedrock,而 Bedrock 則是基於 k8s 上的系統。
  • 一個 job queue 子系統,後面是 Kafka + Redis 組成的。
  • 一組在 Vitess 上的表格,所以後面是 MySQL

這樣的系統也注定這是 Slack-only 的系統了,看一下知道用什麼就好了...

把 RabbitMQ 換成 PostgreSQL 的那篇文章...

Hacker News 上看到「SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres Queue (prequel.co)」這篇文章,原文在「SQL Maxis: Why We Ditched RabbitMQ And Replaced It With A Postgres Queue」這邊,裡面在講他們把 RabbitMQ 換成 PostgreSQL 的前因後果。

文章裡面可以吐嘈的點其實蠻多的,而且在 Hacker News 上也有被點出來,像是有人就有提到他們遇到了 bug (或是 feature) 卻不解決 bug,而是決定直接改寫成用 PostgreSQL 來解決,其實很怪:

In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

另外一個吐嘈的點是量的部份,如果就這樣的量,用 PostgreSQL 降低使用的 tech stack 應該是個不錯的決定 (但另外一個問題就是,當初為什麼要導入 RabbitMQ...):

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.

在同一個 thread 下面也有人提到這個量真的很小,甚至直接不講武德提到可以用 Jenkins 解 XD:

Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.

然後意外看到 Simon Willison 提到了一個重點,就是 RabbitMQ 到現在還是不支援 ACID 等級的 job queuing (尤其是 Durability 的部份),也就是希望 MQ 系統回報成功收到的 task 一定會被處理:

The best thing about using PostgreSQL for a queue is that you can benefit from transactions: only queue a job if the related data is 100% guaranteed to have been written to the database, in such a way that it's not possible for the queue entry not to be written.

Brandur wrote a great piece about a related pattern here: https://brandur.org/job-drain

He recommends using a transactional "staging" queue in your database which is then written out to your actual queue by a separate process.

這也是當年為什麼用 MySQL 幹類似的事情,要 ACID 的特性來確保內容不會掉。

這也是目前我覺得唯一還需要用 RDBMS 當 queue backend 的地方,但原文公司的想法就很迷,遇到 library bug 後決定換架構,而不是想辦法解 bug,還很開心的寫一篇文章來宣傳...

FBI 警告愈來愈多使用假身份與 Deepfake 技術應徵遠端工作的事件

Hacker News 上看到「FBI: Stolen PII and deepfakes used to apply for remote tech jobs (bleepingcomputer.com)」這個很「有趣」的文章,原報導在「FBI: Stolen PII and deepfakes used to apply for remote tech jobs」,另外 FBI 的公告在「Deepfakes and Stolen PII Utilized to Apply for Remote Work Positions」這邊。

The FBI Internet Crime Complaint Center (IC3) warns of an increase in complaints reporting the use of deepfakes and stolen Personally Identifiable Information (PII) to apply for a variety of remote work and work-at-home positions.

當 deepfake 的技術愈來愈成熟後,這個問題應該會愈來愈嚴重?

另外讓我想到,先前有人發現錄取的人跟面試的人好像不一樣的情況,但一時間找不到那篇文章...

Brendan Gregg 離開 Netflix

Brendan Gregg 宣佈離開 Netflix:「Netflix End of Series 1」,Hacker News 上他也有跳出來回答一些問題:「Netflix End of Series 1 (brendangregg.com)」。

看到有些問題還蠻有趣的,像是被問到桌子的大小:

Off topic: I’m a bit surprised about Gregg’s desk (pre-pandemic). I imagine he’s getting a top level salary at Netflix but yet he’s got a small desk in what it looks to me a shared small office (or perhaps is that a mini open space office? Can’t tell).

大概是在文章裡面有圖,所以被問:

他的回答:

A number of times people have asked about my desk over the years, and I'm curious as to why! I've visited other tech companies in the bay area, and the desks I see (including for 7-figure salary engineers) are the same as everyone else, in open office layouts. At Netflix it's been open office desks, and all engineers have the same desk.

Does some companies give bigger desks for certain staff, or offices, or is it a country thing (Europe?).

目前還沒有提到下一份工作是什麼:

I'll still be posting here in my next job. More on that soon...

PostgreSQL 的 Job Queue、Application Lock 以及 Pub/Sub

Hacker News Daily 上看到一篇講 PostgreSQL 做 Job Queue、Application Lock 以及 Pub/Sub 的方法:「Do You Really Need Redis? How to Get Away with Just PostgreSQL」,對應的討論在「Do you really need Redis? How to get away with just PostgreSQL (atomicobject.com)」這邊可以翻到。

拿 PostgreSQL 跑這些東西的確有點浪費,不過如果是自己的專案,不想要把 infrastructure 搞的太複雜的話,倒是還不錯。

首先是 Job Queue 的部份,從他的範例看起來他是在做 async job queue (不用等回傳值的),這讓我想到很久前寫的 queue service (應該是 2007 年與 2012 年都寫過一次),不過我是用 MySQL 當作後端,要想辦法降低 InnoDB 的 lock 特性。

async job queue 設計起來其實很多奇怪的眉角,主要就是在怎麼處理失敗的狀態。大多數的需求可以放到兩個種類,最常見用的是 at-least-once,保證最少跑一次,大多數從設計上有設計成 idempotence 的都可以往這類丟,像是報表類的 (重複再跑一次昨天的報表是 OK 的),另外每天更新會員狀態也可以放在這邊。

另外少見一點的是 at-most-once 與 exactly-once,最多只跑一次與只跑一次,通常用在不是 idempotence 的操作上,像是扣款之類的,這邊的機制通常都會跟商業邏輯有關,反正不太好處理...

第二個是 Application Lock,跨機器時的 lock 機制,量沒有很大時拿 PostgreSQL 跑還行,再大就要另外想辦法了,馬上想到的是 ZooKeeper,但近年設計的系統應該更偏向用 etcdConsul 了...

最後提到的 Pub/Sub,一樣是在量大的時候拿 PostgreSQL 跑還行,更大的時候就要拿 Kafka 這種專門為了效能而設計出來的軟體出來用...

新創的 Stock Options 風險與價值

看到 TLDR Stock Options 這個站,印象中這個站已經存在一陣子了 (以前有看過的印象,但沒寫下來?),不知道後續資料有沒有更新...

這個站把輸入的資料大幅簡化 (只需要寫比重與現在在哪一輪),然後給兩個數字,一個是你可以預期公司成功 exit 的比率,另外一個是你預期會拿到的金錢。

這是大幅簡化的數字 (沒有區分領域與地域),不過如果對 stock options 沒有概念的話可以抓個感覺。

2018 年矽谷科技公司的薪資

不太意外的,排名起來加州這一區的科技公司的薪資還是最高的 (這邊包括了所有的所得,包括薪資、股票與分紅):「Top Paying Tech Companies of 2018」。

已經先整理出來的前五名分成「Entry-level / 1+ Yrs of Experience」、「Mid-level / 3+ Yrs of Experience」、「Been Around the Block / 5+ Yrs of Experience」三類,可以看到相對於年資的增加,薪資的調整也很快...

不過這邊相同名次的不會佔多個位置,只會佔一名,跟我們平常用的方式不太一樣,所以雖然是前五名但是都有六個公司。

在程式競賽得獎負面相關於工作的品質

2015 的文章以及演講,最近冒出來看到的。GooglePeter Norvig 提到了用 ML 的方式分析,發現程式競賽的成績與工作品質的負面相關性:「Being good at programming competitions correlates negatively with being good on the job」。

換句話說,程式競賽的成績反而是是個負面指標 (對於 Google 內的情況分析出來的,所以是基於 Google hiring process 的前提過濾過的)。

In this talk, Peter talked about how Google did machine learning and at one point he mentioned that at Google they also applied machine learning to hiring. He said that one thing that was surprising to him was that being a winner at programming contests was a negative factor for performing well on the job.

給了一些對原因的猜測:

Peter added that programming contest winners are used to cranking solutions out fast and that you performed better at the job if you were more reflective and went slowly and made sure things were right.

YouTube 的留言處也有一些猜測,像是:

What he's talking about is the fact that several extremely important parts of software engineering are not included in these contests, for example code reusability, maintainability, decomposition of the problem using the OO paradigm, etc. All of these make a good engineer, but are not necessarily needed in competitive programming contests.

「把工作自動化」的討論

最近在 The Workplace Stack Exchange 上還蠻火紅的一篇文章:「Is it unethical for me to not tell my employer I’ve automated my job?」。

作者的全職工作是從系統上抓資料出來,貼到 spreadsheet 上 (也許是 Excel?),這份工作的薪資還不錯,然後作者寫程式自動化掉後發現他每禮拜只需要做一兩個小時了:

There might be amendments to the spec and corresponding though email etc, but overall, I spend probably 1-2 hours per week on my job for which I am getting a full time wage.

然後在糾結要不要跟雇主講,跑上來發文 XDDD 有興趣的人可以去圍觀看一看下面的回應...

Amazon ECS 可以跑 cron job 了...

Amazon ECS 上面固定時間跑某些東西,以前得自己用 AWS Lambda 帶 (或是自己架,不過這樣就要自己考慮 High Availability 架構了),現在則是直接支援:「Amazon ECS Now Supports Time and Event-Based Task Scheduling」。

Previously, you could start and stop Amazon ECS tasks manually, but running tasks on a schedule required writing and integrating an external scheduler with the Amazon ECS API.

Now you can schedule tasks through the Amazon ECS console on fixed time intervals (e.g.: number of minutes, hours, or days). Additionally, you can now set Amazon ECS as a CloudWatch Events target, allowing you to launch tasks by using CloudWatch Events.