Home » Posts tagged "json"

新出的 RFC 8259:The JavaScript Object Notation (JSON) Data Interchange Format

JSON 的規格書又被更新了 XD

在「The Last JSON Spec」這邊,Tim Bray 寫了這篇關於新的 RFC 8259 跟之前的差異,以及大家對於雙重標準的顧慮。

最大的差異在於,在 RFC 8259 規定了「如果 JSON 被用在非封閉的系統交換資料,必須使用 UTF-8」:

8259 con­tains one new sen­tence: “JSON text ex­changed be­tween sys­tems that are not part of a closed ecosys­tem MUST be en­cod­ed us­ing UTF-8 [RFC3629].” Giv­en that, by 2017, an at­tempt to ex­change JSON en­cod­ed in any­thing but UTF-8 would be ir­ra­tional, this hard­ly needs say­ing; but its ab­sence felt like an omis­sion.

而關於 ECMA-404 與 RFC 8259 都定義了 JSON 的問題他也說明了,因為很多人花了很多力氣在確保這兩份文件的正確性上,所以應該不會有問題 (i.e. 衝突):

The rea­son 8259 ex­ists is that the ECMAScript gang went and wrote their own ex­treme­ly min­i­mal spec, Stan­dard ECMA-404: The JSON Da­ta In­ter­change Syn­tax, and there was rea­son for con­cern over du­el­ing stan­dard­s. But, af­ter a cer­tain amount of standards-org elephant-gavotte, each of ECMA 404 and RFC 8259 nor­ma­tive­ly ref­er­ences the oth­er and con­tains a com­mit­ment to keep them con­sis­tent in case any er­rors turn up. Which is a good thing, but this text has been re-examined and re-polished so many times that I doubt ei­ther side will ev­er re­vis­it the ter­ri­to­ry, thank good­ness.

另外他也提到了對於不同情境下可以看不同的文件。像是要了解 JSON 的話,可以看當初發明 JSON 的 Doug Crockford 所設立的網站 (在「JSON」這邊);而在交換時應該參考 I-JSON (Internet JSON,RFC 7493):

Which spec should you use? · If you want to un­der­stand JSON syn­tax, you still can’t beat Doug Crockford’s orig­i­nal for­mu­la­tion at JSON.org. If you want to use an RFC as foun­da­tion for a REST API or some oth­er In­ter­net pro­to­col, I ac­tu­al­ly don’t rec­om­mend 8259, I rec­om­mend I-JSON, RFC 7493, which de­scribes ex­act­ly the same syn­tax as all the oth­er specs (by ref­er­enc­ing 7159), but ex­plic­it­ly rules out some legal-but-dumbass things you could do that might break your pro­to­col, for ex­am­ple us­ing any­thing but UTF-8 or hav­ing du­pli­cate mem­ber names in your ob­ject­s.

I-JSON 是 JSON 的子集合,比較重要的:

  • (MUST) 使用 UTF-8。
  • (SHOULD NOT) 浮點數的部份,不得超過 IEEE 754-2008 binary64 (double precision) 的範圍。
  • (SHOULD NOT) 整數的部份,不得超過 [-(2**53)+1, (2**53)-1]) 的範圍。
  • (RECOMMEND) 有超過的需求使用字串表示。
  • (MUST NOT) JSON object 內不得有重複的 name。
  • (SHOULD NOT) 最上層的型態不得使用字串,只能使用 object 或是 array。
  • (MUST NOT) 遇到先前沒有定義過的元素不得視為錯誤。(像是新版 API 內會在 object 裡增加元素)
  • (RECOMMEND) 時間使用 ISO 8601 表示 (在 RFC 3339 有提到),英文字的部份全部使用大寫,一定要標上時區,而秒數的 0 一定要加上去 (也就是 00 秒)。
  • (RECOMMEND) 時間長度也建議依照 RFC 3339 處理。
  • (RECOMMEND) Binary 資料用 base64url 傳 (RFC 4648)。

JSON 的 Object 裡 Key 重複的問題

tl;dr:不要亂來啦... 這是 UB (Undefined behavior) 的一種。

因為看到這則 tweet,所以去查一下 JSON 的資料:

首先是找標準是什麼。在維基百科的 JSON 條目裡提到了有兩份標準,一份是 RFC 7159,一份是 ECMA-404

Douglas Crockford originally specified the JSON format in the early 2000s; two competing standards, RFC 7159 and ECMA-404, defined it in 2013. The ECMA standard describes only the allowed syntax, whereas the RFC covers some security and interoperability considerations.

ECMA-404 裡面就真的只講語法沒講其他東西,而在 RFC 7159 內的 Object 則是有提到 (重點我就用粗體標起來了):

An object structure is represented as a pair of curly brackets surrounding zero or more name/value pairs (or members). A name is a string. A single colon comes after each name, separating the name from the value. A single comma separates a value from a following name. The names within an object SHOULD be unique.

   object = begin-object [ member *( value-separator member ) ]
            end-object

   member = string name-separator value

An object whose names are all unique is interoperable in the sense that all software implementations receiving that object will agree on the name-value mappings. When the names within an object are not unique, the behavior of software that receives such an object is unpredictable. Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.

JSON parsing libraries have been observed to differ as to whether or not they make the ordering of object members visible to calling software. Implementations whose behavior does not depend on member ordering will be interoperable in the sense that they will not be affected by these differences.

粗體有描述唯一性,但尷尬的地方在於他用 SHOULD 而非 MUST,所以 library 理論上都要能接受。但後面提到如果不唯一時,行為無法預測 (會到 rm -rf / 嗎?XDDD 最像的應該還是 crash?),所以還是不要亂來啦...

不過如果真的會 crash 的話,應該也會因為 DoS issue 而被發 CVE,所以實務上應該是不會 crash 啦...

在 CLI 下開關以及查詢 EC2 的狀態...

有時候需要開 Ubuntu 測試東西,會在 AWS 上開 EC2 起來測試,但開 console 太麻煩了,寫幾個 function 丟進 shell script 裡面比較乾脆。其中查詢 Ubuntu AMI 的程式出自「How do I know what Ubuntu AMI to launch on EC2?」這邊。

ec2.ls() 裡,我的 jq 版本比較舊,不過不影響我的 copy & paste,所以就沒有 hack 他了。新版的應該可以多加上 | @tsv 變成 tab 隔開 (沒測過,查資料時查到而已)。

ec2.run() 裡,我這邊是先到 console 上查出 security group 與 subnet 的 id,然後這邊 hard code 進去。我的預設是開 t2.medium,臨時要指定的話就 ec2.run t2.nano 就可以改開 t2.nano 了,不過要注意的是,這邊程式在查詢時的條件是 hvm:ebs,換的時候要注意 image 相容性...

# AWS-related
function ec2.ls() {
    aws ec2 describe-instances | \
        jq -c -M '.Reservations[] | .Instances[] | [.InstanceId, .InstanceType, .PublicIpAddress]'
}

function ec2.rm() {
    local INSTANCE_ID=${1:i-xxxxxxxxxxxxxxxxx}
    aws ec2 terminate-instances --instance-id ${INSTANCE_ID}
}

function ec2.run() {
    local INSTANCE_TYPE=${1:-t2.medium}
    aws ec2 run-instances --image-id $(ec2.ubuntu_ami()) --key-name gslin --security-group-ids sg-xxxxxxxx --instance-type ${INSTANCE_TYPE} --subnet-id subnet-xxxxxxxx
}

function ec2.ubuntu_ami() {
    curl -s "https://cloud-images.ubuntu.com/locator/ec2/releasesTable" | \
    sed '$x;$G;/\(.*\),/!H;//!{$!d};$!x;$s//\1/;s/^\n//' | \
    jq -c '.aaData[] | select(contains(["16.04", "us-east-1", "hvm:ebs"]))' | \
    grep -o 'ami-[a-z0-9]\+' | \
    head -1
}

這種工具自己用的順手比較重要,要什麼功能自己改自己加...

話說 Ubuntu 網站上的 JSON 居然吐出 malformed data (trailing comma),這是自己 printf() 之類硬幹出來的嗎... XD

HTTP/2 時代的 API 設計

在「Let’s Stop Building APIs Around a Network Hack」這邊提到了以前為了解決 HTTP/1.1 的問題而發展出來的 workaround,在 2015 年發展出來的 HTTP/2 從底層直接解了不少問題,加上很快被許多瀏覽器支援 (就算不支援 HTTP/2 也只是降到 HTTP/1.1 跑,比較慢而已):

Guess what else was released in May 2015? RFC 7540, otherwise known as HTTP/2. In retrospect this seems highly poetic, as HTTP/2 kinda makes the compound document aspect of JSON-API a little bit pointless, and compound documents to me go hand in hand with what JSON-API is as a standard.

2012 年在 MOPCON 第一屆講的「API Design Optimized for Mobile Platform」剛好就是這個主題:

有種懷念感... XD

PHP 7.3 的 json_decode() 將會用 Exception 處理錯誤

在「PHP: rfc:json_throw_on_error」這邊提到 PHP 7.3 會解決 json_decode() 發生錯誤時的處理方式:

PHP has two functions for dealing with JSON, json_decode() and json_encode(). Unfortunately, both have suboptimal error handling. json_decode() returns null upon erroring, but null is also a possible valid result (if decoding the JSON “null”).

在這之前唯一的判斷方式是另外再呼叫 json_last_error() 或是 json_last_error_msg(),但這樣寫很辛苦,所以要引入 JsonException 了,總算...

Amazon Athena:直接在 S3 上進行分析

Amazon Athena 提供另外一種選擇,讓分析的便利性增加了許多:「Amazon Athena – Interactive SQL Queries for Data in Amazon S3」。

以往都需要開 server 起來分析,這個新的服務直接使用就好:

Athena is based on the Presto distributed SQL engine and can query data in many different formats including JSON, CSV, log files, text with custom delimiters, Apache Parquet, and Apache ORC.

果然是用 Presto 改出來的... XDDD

指定好各種資料來源之後直接下 SQL query 分析,然後依照分析的量來算錢... 而 FAQ 的地方也有提到可以透過 JDBC 接上去,這樣看起來跑報表的場合直接丟給他處理了:

Amazon Athena can be accessed via the AWS management console and a JDBC driver. You can programmatically run queries, add tables or partitions using the JDBC driver.

隔壁 Amazon Redshift 的立場變得很尷尬啊,Amazon Athena 不需要養機器而且又可以直接從 Amazon S3 拉資料,如果之後把 Presto 對 RDBMS 的部分再補上來的話就更棒了... (應該是下一階段的任務,把 RDS 補上)

Pinterest 在 InnoDB Compression 的努力

Pinterest 用 InnoDB 儲存各式資料,而且使用了 InnoDB Compression 的功能。他們花了不少力氣跟 Percona 合作改善 InnoDB Compression 的效能:「Evolving MySQL Compression - Part 1」。

文章有點長度,重點在於他們在 MySQL 裡面放了大量的 JSON:

A Pin is stored as a 1.2 KB JSON blob in sharded MySQL databases.

他們發現新版 zlib 的 predefined dictionary 可以讓壓縮率變得更高 (從本來的 ~50% 到 ~66%);而除了壓縮率變高外,由於事先定義了字典內容,對於效能的提昇也不少 (warm up):

Zlib version 1.2.7.1 was released in early 2013 and added the ability to use a predefined “dictionary” to prefill the lookback window for LZ77. This seemed promising since we could “warm up” the lookback window with field names and other common strings. We ran a few tests using the Python Zlib library with a naive predefined dictionary consisting of an arbitrary Pin JSON blob. The compression savings increased from ~50% to ~66% at what appeared to be relatively little cost.

另外他們做了 read-only 的 benchmark (畢竟這是重點)。圖片資料有點糊,但可以看出 y 軸是 Queries/sec。而 x 軸上則用文字給了些說明,黃色是 TokuDB,紅色是本來的 InnoDB Compression,剩下的都是不同的字典集的成果:

Below is a graph from our presentation which showed a read-only version of our production workload at concurrency of 256, 128, 32, 16, 8, 4 and 1 clients. TokuDB is in yellow, InnoDB page compression is in red and the other lines are column compression with a variety of dictionaries.

整體效率都比之前高不少,尤其是當 concurrent query 的數量偏高的時候差距會很大。

而這個功能將會納入未來的 Percona 版本,對於在 MySQL 裡面會塞 JSON 或是 XML 的人應該會很有幫助:

We worked with Percona to create a specification for column compression with an optional predefined dictionary and then contracted with Percona to build the feature.

設計資料同步問題時一定會遇到的 Conflict 解決方案

在「A Conflict-Free Replicated JSON Datatype」這邊看到有趣的東西。(arXiv 說 2016/08/18 會有一個小時的 downtime,台灣時間剛好是 2016/08/18 的 20:20 開始:「Maintenance scheduled for Aug 18 8:20 a.m. EDT」)

作者們設計這個架構是想要在 JSON 結構上找出一個演算法,在 P2P 架構上 (而不需要靠 server) 可以同步並且產生一致的結果,另外要求當 conflict 時不要掉資料:

In this paper we present an algorithm and formal semantics for a JSON data structure that automatically resolves concurrent modifications such that no updates are lost, and such that all replicas converge towards the same state.

作者提出來的想法不是很複雜,而且 merge 保留姿的方法也頗... 特別,但總是給大家一個想法,各何況很多情況下都是有 server 架構,就簡單多了...

MySQL 5.7 的 JSON、Virtual Column 以及 Index

Percona 提到了 MySQL 5.7 的 JSON 與 virtual column,再加上 index 後的效能提昇:「JSON document fast lookup with MySQL 5.7」。

每一家都把這些功能給做出來了,在 MySQL 5.7 提供了 JSON 格式:

CREATE TABLE `test_features` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `feature` json NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

而你可以對 JSON 欄位運算,拉出資料後產生出 virtual column:

ALTER TABLE test_features ADD COLUMN street VARCHAR(30) GENERATED ALWAYS AS (json_unquote(json_extract(`feature`,'$.properties.STREET'))) VIRTUAL;

然後就可以對 virtual column 下 index:

ALTER TABLE test_features ADD KEY `street` (`street`);

接著對 virtual column 查詢的速度就會超快:

SELECT count(*) FROM test_features WHERE street = 'BEACH';

其實不一定要 JSON,光是 virtual column 與 index 就可以解決老問題:

已經有使用者所在的國家,想要快速查詢使用者是住在哪個洲 (亞洲、歐洲、美洲、...)。

以前是另外拆出一個欄位來做,用 trigger 更新確保資料正確性後,再對拆出來的欄位下 index。現在可以用 virtual column 建立出來下 index。

Composer 出現 lock file out of date 的解法

之前也有遇到,結果 Lorna Jane Mitchell 寫了一篇「Handling Composer "lock file out of date" Warning」給了不少解法。

第一種是 composer upgrade,直接全部升級。

第二種是先用 composer update --dry-run 看看是不是有人加到 composer.json 後忘記更新,如果是的話再用 composer update [package] 處理。

第三種是用 composer update nothing 閉上眼睛更新 hash 值,當作什麼都沒看到 XDDD

Sometimes this is the right answer so it's a handy trick to know!. I seem to see these kinds of issues in people's projects quite often (I'm a consultant, I see a lot of projects) so I thought I'd share my usual tactics for getting things sorted - if you have any tricks of your own to share, I'd love to hear them :)

該說很「實用」嗎 XDDD

Archives