Home » Computer » Archive by category "Software"

openrsync

在「openrsync imported into the tree」這邊看到 openrsync 專案進入到 OpenBSD 的 source tree 內。

rsync 是使用 GPLv3 授權,這個授權剛出來的時候,幾個比較大的 BSD 的團隊都有找律師研究過,最後都是做出不要把 GPLv3 的軟體放進 source tree 的建議,但 rsync 算是很好用的工具 (尤其是在效率上)。

看起來 openrsync 這個專案主要的目的就是重新實做出 ISC license 版本的 rsync:

This is an implementation of rsync with a BSD (ISC) license. It's compatible with a modern rsync (3.1.3 is used for testing, but any supporting protocol 27 will do), but accepts only a subset of rsync's command-line arguments.

然後目前只有設計在 OpenBSD 上跑,其他平台可能需要花些時間 porting 修正相容性:

At this time, openrsync runs only on OpenBSD. If you want to port to your system (e.g. Linux, FreeBSD), read the Portability section first.

GitHub 上的 Git repository 只是個 mirror,真正在管理程式碼的部份還是使用 CVS

This repository is a read-only mirror of a private CVS repository. I use it for issues and pull requests. Please do not make feature requests: I will simply close out the issue.

從 Microsoft SQL Server 轉移到 PostgreSQL 的工具

在「How to Migrate from Microsoft SQL Server to PostgreSQL」這邊看到作者的客戶需要把 Microsoft SQL Server 轉移到 PostgreSQL (但沒有提到原因)。

裡面主要是兩個階段的轉換,第一個階段是 schema 的轉換,作者提到了 dalibo/sqlserver2pgsql 這個用 Perl 寫的工具:

Migration tool to convert a Microsoft SQL Server Database into a PostgreSQL database, as automatically as possible http://dalibo.github.io/sqlserver2pgsql

第二個階段是資料的轉換,是選擇用 Pentaho Data Integration 的 Community Edition:

Pentaho offers various stable data-​centric products. Pentaho Data Integration (PDI) is an ETL tool which provides great support for migrating data between different databases without manual intervention. The community edition of PDI is good enough to perform our task here. It needs to establish a connection to both the source and destination databases. Then it will do the rest of work on migrating data from SQL server to Postgres database by executing a PDI job.

所以用兩個工具串起來... 另外在文章裡面沒提到 stored procedure 之類的問題,應該是他們的客戶沒用到或是很少用到?

PostgreSQL 對 fsync() 的行為傷腦筋...

FOSDEM 2019 上的演講,討論 PostgreSQL 在確保 ACID 特性中的 Durability 時遇到 fsync() 的行為跟預想的不一樣 (主要是當 fsync() 失敗的行為):「PostgreSQL vs. fsync」。

在「PostgreSQL vs. fsync. How is it possible that PostgreSQL used fsync incorrectly for 20 years, and what we'll do about it.」這邊的 Q&A 形式的訪談有快速描述了短期的計畫與長期的想法:

The short-term solution is ensuring that we detect fsync errors reliably at least on sufficiently recent kernels (since 4.13). On older kernels we can’t do much better, unfortunately.

The long-term solution is still being discussed in the community, but it’s hard to say how we could keep relying on buffered I/O in the future. So we may end up with direct I/O, but that’s a pretty significant change and is likely going to be a multi-year project.

MySQL 這邊則是以 O_DIRECT 為主的世界,受到的影響就小很多了...

Mercury Web Parser 開源

看到「Mercury Goes Open Source!」這篇,Postlight 的團隊開源了 Mercury Web Parser,程式碼在 GitHub 上的 postlight/mercury-parser 可以取得。

這個版本是用 Node.js 寫的,可以從範例看出用法以及結果:

import Mercury from '@postlight/mercury-parser';
Mercury.parse(url).then(result => console.log(result););
{
  "title": "Thunder (mascot)",
  "content": "<div><div><p>This is the content of the page!</div></div>",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

先前其他的軟體與服務可以參考「Evaluating Text Extraction Algorithms」這篇的整理與比較,不過這篇連原網站都不見了... 只能從 Internet Archive 上翻出來。

這個主題有不少團隊都做過 (給一個 html 網頁,抓出實際的內容塊落),但也死了不少團隊... 比較有印象的是 Readability,在 2016 年收掉了:「The Readability bookmarking service will shut down on September 30, 2016.」。

要撈資料可以拿來用...

Braintree (PayPal) 用 PostgreSQL 的方式

RDBMS 最困難的事情都圍繞在「怎麼不中斷服務」(很多事情在不用考慮 uptime/downtime 的前提下很好做,不論是 ALTER 或是 failover,到備份還原計畫),而 PayPalBraintree 在「PostgreSQL at Scale: Database Schema Changes Without Downtime」這邊討論修改 PostgreSQL 的 database schema 時怎麼不中斷服務。

文章內的大部份都是給 DBA 知道的細節 (e.g. 怎麼樣才不會觸發大規模的 lock 導致服務中斷),而不是開發者面向的事情... 但開頭的部份,也是我認為最重要的部份,則是需要 Developer 參與的:

For all code and database changes, we require that:

  • Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
  • New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.

為了符合這兩個要素,可能會在 schema 設計上有好幾個階段的操作,而非一次到位。而且也才能避免要關站從 backup 倒資料回來的情況...

建議可以研究看看要怎麼玩,常見的情境知道怎麼設計步驟後,真的遇到的時候會比較熟練。

擋 Facebook 廣告的 Userscript

Facebook 為了反制各種「擋廣告軟體」,用了各種奇怪的 DOM 在擋:

目前看起來 ublock origin 這類擋廣告軟體支援的格式已經擋不住了,得靠其他工具來擋... 用到現在一直有在更新的「Facebook unsponsored」算是還行... 看 source code 可以看到他是直接抓有顯示的字串來分析,所以不會受到 DOM 的干擾,不過最近看起來又開始被搞了... XD

JPMorgan Chase 的 WePay 用的 MySQL 架構

看到「Highly Available MySQL Clusters at WePay」這篇講 WePayMySQL 的設計,本來以為是 WeChat 的服務,仔細看查了之後發現原來是 JPMorgan Chase 的服務...

架構在 GCP 上面,本來的 MySQL 是使用 MHA + HAProxy (patch 過的版本,允許動態改變 pool),然後用 Routes 處理 HAProxy 的 failover。

他們遇到的問題是 crash failover 需要至少 30 分鐘的切換時間,另外就是在 GCP 上面跨區時會有的 network partition 問題...

後續架構變得更複雜,讓人懷疑真的有解決問題嗎 XDDD

改用 GitHub 推出的 Orchestrator 架構,然後用兩層 HAProxy 導流 (一層放在 client side,另外一層是原來架構裡面的 load balancer),在加上用 Consul 更新 HAProxy 的資訊?

思考為什麼會有這樣設計 (考慮到金融體系的背景),其實還蠻有趣的...

DynamoDB Autoscaling 的各種眉眉角角...

AdRollDynamoDB Autoscaling 的踩雷記錄,裡面有些資訊如果不是跳下去玩應該不會注意到 (魔鬼藏在細節裡的感覺):「Managing DynamoDB Autoscaling with Lambda and Cloudwatch」。

第一個提到的問題是 autoscaling 的觀察對象:

Ideally, the table should scale based on the number of requests that we are making , not the number of requests that are successful.

另外一個是 autoscaling 遇到完全不用的情況下不會 scale down,看起來是某種保護機制。但這使得平常只有拿來讀取的表格在跑完 batch job 後得自己處理 write scale down 問題:

Additionally, at the time of implementing this algorithm, the DynamoDB capacity could not be brought down automatically if the consumption was exactly zero, which can happen if you write to your table in batch instead of realtime, for example.

This meant that, when enabling autoscaling, tables that were read in realtime, but written to in batch, still needed manual intervention to bring the write capacity down after our jobs were done writing.

另外一個問題是 scale down 是有次數限制的:

Another interesting point that might bite users is that capacity decreases are an expensive operation for AWS, so they’re limited.

The number of decreases cited in the documentation can be achieved under very special conditions, since you need to have 4 decreases in the first hour of the day plus one for each of the remaining hours, for a total of 4 (first hour) + 23 (1 hourly) = 27.

後面就是自己研究什麼 algorithm 可以調整的更細,然後用 lambda 重寫... 最後省下 30% 的成本:

Here is where we detected our costs for our batch tables dropping to around 30% of the initial cost.

AdRoll 的規模應該是不小,所以為了省 30% 可以花不少力氣在上面...

Archives