比 Bloom filter 與 Cuckoo filter 再更進一步的 Xor filter

Bloom filter 算是教科書上的經典演算法之一,在實際應用上有更好的選擇,像是先前提到的 Cuckoo filter:「Cuckoo Filter:比 Bloom Filter 多了 Delete」。

現在又有人提出新的資料結構,號稱又比 Bloom filter 與 Cuckoo filter 好:「Xor Filters: Faster and Smaller Than Bloom Filters」。

不過並不是完全超越,其中馬上可以看到的差異就是不支援 delete:

Deletions are generally unsafe with these filters even in principle because they track hash values and cannot deal with collisions without access to the object data: if you have two objects mapping to the same hash value, and you have a filter on hash values, it is going to be difficult to delete one without the other.

論文的預印本可以在 arXiv 上下載:「Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters」。

省頻寬的方法:終極版本...

看到「Three ways to reduce the costs of your HTTP(S) API on AWS」這邊介紹在 AWS 上省頻寬費用的方法,看了只能一直笑 XD

第一個是降低 HTTP response 裡沒有用到的 header,因為每天有五十億個 HTTP request,所以只要省 1byte 就是省下 USD$0.25/day:

Since we would send this five billion times per day, every byte we could shave off would save five gigabytes of outgoing data, for a saving of 25 cents per day per byte removed.

然後調了一些參數後省下 USD$1,500/month:

Sending 109 bytes instead of 333 means saving $56 per day, or a bit over $1,500 per month.

第二個是想辦法在 TLS 這邊下手,一開始其中一個方向是利用 TLS session resumption 降低第二次連線的成本,但他們發現沒有什麼參數可以調整:

One thing that reduces handshake transfer size is TLS session resumption. Basically, when a client connects to the service for the second time, it can ask the server to resume the previous TLS session instead of starting a new one, meaning that it doesn’t have to send the certificate again. By looking at access logs, we found that 11% of requests were using a reused TLS session. However, we have a very diverse set of clients that we don’t have much control over, and we also couldn’t find any settings for the AWS Application Load Balancer for session cache size or similar, so there isn’t really anything we can do to affect this.

所以改成把 idle 時間拉長 (避免重新連線):

That leaves reducing the number of handshakes required by reducing the number of connections that the clients need to establish. The default setting for AWS load balancers is to close idle connections after 60 seconds, but it seems to be beneficial to raise this to 10 minutes. This reduced data transfer costs by an additional 8%.

再來是 AWS 本身發的 SSL certification 太肥,所以他們換成 DigiCert 發的,大幅降低憑證本身的大小,反而省下 USD$200/day:

So given that the clients establish approximately two billion connections per day, we’d expect to save four terabytes of outgoing data every day. The actual savings were closer to three terabytes, but this still reduced data transfer costs for a typical day by almost $200.

這些方法真的是頗有趣的 XDDD

不過這些方法也是在想辦法壓榨降低與 client 之間的傳輸量啦,比起成本來說反而是提昇網路反應速度...

AWS Outposts 總算要開始出貨了

去年 AWSre:Invent 喊的 AWS Outposts 總算是有東西要出貨了:「AWS Outposts Now Available – Order Yours Today!」。

放在自家實體的機櫃,然後掛到 AWS 上變成一個特殊的 region。目前一個特殊的 region 只能放 16 個機櫃,但預期之後可以更多:

Capacity Expansion – Today, you can group up to 16 racks into a single capacity pool. Over time we expect to allow you to group thousands of racks together in this manner.

不過要注意的是,需要有 AWS Enterprise Support 才能下單,而且看起來硬體的維修也包在內了:

Support – You must subscribe to AWS Enterprise Support in order to purchase an Outpost. We will remotely monitor your Outpost, and keep it happy & healthy over time. We’ll look for failing components and arrange to replace them without disturbing your operations.

看了一下價錢的頁面,如果以北美的 upfront 來算,最便宜的是 OR-L8IF4WFOR-I0OGL02 的 USD$225,504.81,最貴的是 OR-HSZHMMF 的 USD$898,129.52,暫時應該用不到 XDDD

Amazon Redshift 會自動在背景重新排序資料以增加效能

Amazon Redshift 的新功能,會自動在背景重新排序資料以增加效能:「Amazon Redshift introduces Automatic Table Sort, an automated alternative to Vacuum Sort」。

版本要到更新到 1.0.11118,然後預設就會打開:

This feature is available in Redshift 1.0.11118 and later.

Automatic table sort is now enabled by default on Redshift tables where a sort key is specified.

重新排序的運算會在背景處理,另外帶一些行為學習分析:

Redshift runs the sorting in the background and re-organizes the data in tables to maintain sort order and provide optimal performance. This operation does not interrupt query processing and reduces the compute resources required by operating only on frequently accessed blocks of data. It prioritizes which blocks of table to sort by analyzing query patterns using machine learning.

算是丟著讓他跑就好的東西,升級上去後可以看一下 CloudWatch 的報告,這邊沒有特別講應該是還好... XD

Amazon Redshift 可以處理座標資料了

這一個月 AWS 因為舉辦一年一度的 AWS re:Invent,會開始陸陸續續放出各種消息... 這次是 Amazon Redshift 宣佈支援 spatial data,這樣一來就能夠方便的處理座標資料了:「Using Spatial Data with Amazon Redshift」。

支援的種類與使用的限制可以在官方的文件裡面看到,也就是「Querying Spatial Data in Amazon Redshift」與「Limitations When Using Spatial Data with Amazon Redshift」這兩篇。

其實限制還蠻多的,而其中這條看起來有點痛,擴充性受到限制:

Data types for Python user-defined functions (UDFs) don't support the GEOMETRY data type.

看起來暫時是先支援計算的部份,之後看看有沒有機會變成原生型態的支援,這樣使用起來才會比較順...

PostgreSQL 上去識別化的套件

在「PostgreSQL Anonymizer 0.5: Generalization and k-anonymity」這邊看到的套件,看起來可以做到一些常見而且簡單的去識別化功能:

The extension supports 3 different anonymization strategies: Dynamic Masking, In-Place Anonymization and Anonymous Dumps. It also offers a large choice of Masking Functions: Substitution, Randomization, Faking, Partial Scrambling, Shuffling, Noise Addition and Generalization.

看起來可以把欄位轉成 range 這件事情半自動化處理掉 (還是需要 SQL 本身呼叫這些函數),之後遇到 PII 的時候也許會用到...

EC2 推出 18TB 與 24TB 的機器...

AWS 又把機器給生出來啦:「EC2 High Memory Update – New 18 TB and 24 TB Instances」。

一樣是限制要買三年 RI 才能用,不過價錢頁面上好像還在更新,在「Amazon EC2 Dedicated Hosts Pricing」只看到了之前就公佈的 12TB 價錢,還沒看到 18TB 與 24TB 的部份...

然後以前會跟同事說,資料小於這台機器記憶體大小的不能叫 big data (當時是 12TB),現在升級到 24TB 啦...

Backblaze 開了歐洲區機房

Backblaze 開了歐洲機房,所以包括了一般性的 Computer BackupB2 Cloud Storage 都可以選擇要放哪邊了...

歐洲的點是放在荷蘭:

Big news: Our first European data center, in Amsterdam, is open and accepting customer data!

價錢也都跟美國的相同:

Whether you choose EU Central or US West, your pricing for our products will be unchanged:

對於在意資料放美國機房的問題應該有緩解一些...

RDBMS 裡的各種 Lock 與 Isolation Level

來推薦其他人寫的文章 (雖然是在 Medium 上...):「複習資料庫的 Isolation Level 與圖解五個常見的 Race Conditions」、「對於 MySQL Repeatable Read Isolation 常見的三個誤解」,另外再推薦英文維基百科上的「Snapshot isolation」條目。

兩篇文章都是中文 (另外一個是英文維基百科條目),就不重複講了,這邊主要是拉條目的內容記錄起來,然後寫一些感想...

SQL-92 定義 Isolation 的時候,技術還沒有這麼成熟,所以當時在訂的時候其實是以當時的技術背景設計 Isolation,所以當技術發展起來後,發生了一些 SQL-92 的定義沒那麼好用的情況:

Unfortunately, the ANSI SQL-92 standard was written with a lock-based database in mind, and hence is rather vague when applied to MVCC systems. Berenson et al. wrote a paper in 1995 critiquing the SQL standard, and cited snapshot isolation as an example of an isolation level that did not exhibit the standard anomalies described in the ANSI SQL-92 standard, yet still had anomalous behaviour when compared with serializable transactions.

其中一個就是 Snapshot Isolation,近代的資料庫系統都用這個概念實做,但實際上又有不少差別...

另外「Jepsen: MariaDB Galera Cluster」這篇裡出現的這張也很有用,裡面描述了不同層級之間會發生的問題:

這算是當系統有一點規模時 (i.e. 不太可能使用 SERIALIZABLE 避免這類問題),開發者需要了解的資料庫限制...

PostgreSQL 的 Bloom index

前幾天才跟人提到 PostgreSQL 的功能與完整性比 MySQL 多不少,剛剛又看到 Percona 的「Bloom Indexes in PostgreSQL」這篇,裡面提到了 PostgreSQL 可以使用 Bloom filter 當作 index。

查了一下資料是從 PostgreSQL 9.6 支援的 (參考「PostgreSQL: Documentation: 9.6: bloom」這邊的說明),不過說明裡面沒看到 DELETE (以及 UPDATE) 會怎麼處理,因為原版的 Bloom filter 資料結構應該沒有能力處理刪除的情況...

另外這幾年比較有名的應該是 Cuckoo filter,不只支援刪除,而且空間與效能都比 Bloom filter 好,不知道為什麼是實做 Bloom filter...