Tag Archives: data

GrabFood 用定位資料修正餐廳的資訊

Grab 的「How we harnessed the wisdom of crowds to improve restaurant location accuracy」這篇是他們的 data team 整理出來,如何使用既有的資料快速的修正餐廳資訊。裡面提到的方法不需要用到 machine learning,光是一些簡單的統計算法就可以快速修正現有的架構。

這些資訊其實是透過司機用的 driver app 蒐集來的,在 driver app 上有大量的資訊傳回伺服器 (像是定時回報的 GPS 位置,以及取餐狀態),而這些司機因為地緣關係,腦袋裡的資訊比地圖會準不少:

One of the biggest advantages we have is the huge driver-partner fleet we have on the ground in cities across Southeast Asia. They know the roads and cities like the back of their hand, and they are resourceful. As a result, they are often able to find the restaurants and complete orders even if the location was registered incorrectly.

所以透過這些資訊他們就可以反過來改善地圖資料,像是透過司機按下「取餐」的按鈕的地點與待的時間,就可以估算餐聽可能的位置,然後拿這個資訊比對地圖上的資料,就很容易發現搬家但是地圖上沒更新的情況:

Fraction of the orders where the pick-up location was not “at” the restaurant: This fraction indicates the number of orders with a pick-up location not near the registered restaurant location (with near being defined both spatially and temporally as above). A higher value indicates a higher likelihood of the restaurant not being in the registered location subject to order volume

Median distance between registered and estimated locations: This factor is used to rank restaurants by a notion of “importance”. A restaurant which is just outside the fixed radius from above can be addressed after another restaurant which is a kilometer away.

另外也有不少其他的改善 (像是必須在離餐聽某個距離內才能點「取餐」,這個「距離」會因為餐聽可能在室內商場而需要的調整),整個成果就會反應在訂單的取消率大幅下降:

整體看起來是系統產生清單後讓人工後續處理 (像是打電話去店家問?),但這個方式所提供的清單準確度應該很高 (因為司機不會沒事跟自己時間過不去,跑到奇怪地方按下取餐),用這些資料跑簡單的演算法就能夠快速修正不少問題...

PostgreSQL 裡的 B-tree 結構

在「Indexes in PostgreSQL — 4 (Btree)」這邊看到講 PostgreSQLB-tree 結構以及常見的查詢會怎麼使用 B-tree。

裡面講了三種查詢,第一種是等號的查詢 (Search by equality),第二種是不等號的查詢 (Search by inequality),第三種是範圍的查詢 (Search by range)。再後面講到排序與 index 的用法。

雖然是分析 PostgreSQL,但裡面是一般性的概念,其他使用 B-tree 結構的資料庫也是類似作法...

找數列的平均值

2016 年的文章,不過算是經典的題目,所以最近又冒出來了。要怎麼找數列的平均值:「Calculating the mean of a list of numbers」。

You have a list of floating point numbers. No nasty tricks - these aren’t NaN or Infinity, just normal “simple” floating point numbers.

Now: Calculate the mean (average). Can you do it?

你有一串浮點數 (沒有 NaN 與 Infinity),要怎麼找出平均值。要考慮的包括:

  • 第一個要處理的就是設計演算法時各種會 overflow 的情況。
  • 降低誤差。
  • 合理的計算量。

好像很適合拿來 data team 面試時互相討論的題目?因為「平均值」是個商業上本來就有意義的指標,而且從 time-series events 灌進來的資料量有機會產生各種 overflow 情境,或是精確度問題,所以這個問題其實是個在真實世界上會遇到的情境。

想了一下,如果是 integer 的確是簡單很多 (可以算出正確的值),但如果是 float 類型真的難很多:

It also demonstrates a problem: Floating point mathematics is very hard, and this makes it somewhat unsuitable for testing with Hypothesis.

馬上想到的地雷是在 IEEE 754 的 float 世界裡,2^24 + 1 還是 2^24

#include <math.h>
#include <stdio.h>

int main(void)
{
    int i;
    float a;

    for (i = 0; i < 32; i++) {
        a = pow(2, i);
        printf("2^%d     = %f\n", i, a);

        a += 1;
        printf("2^%d + 1 = %f\n", i, a);
    }
}

然後在這邊可以看出差異:

2^23     = 8388608.000000
2^23 + 1 = 8388609.000000
2^24     = 16777216.000000
2^24 + 1 = 16777216.000000

YAML 的痛點

Changelog 上看到「In defense of YAML」這篇講 YAML 的問題,裡面是引用「In Defense of YAML」這篇文章。

未必全盤接受文章裡面的說法,但裡面提到的兩個點的確是痛點,第一個是空白 (或者說 indent),第二格式特殊語法。這兩個是用 YAML 時都很頭痛的問題:

Whitespace is a minefield. Its syntax is surprisingly complex.

就像 JavaScript 的 == 一樣 (我指的是之前寫的「JavaScript 的 == 條列式比較」這篇),你可以把定義背下來,但你會覺得沒什麼道理可言而有種無奈的感覺...

文章裡也有提到 JSON 內沒有 comment 的設計的確是用起來比較無奈的地方...

Facebook 員工爆料內部密碼存了明碼

Krebs on Security 這邊看到的:「Facebook Stored Hundreds of Millions of User Passwords in Plain Text for Years」,Facebook 官方的回應在「Keeping Passwords Secure」這邊。

幾個重點,第一個是範圍,目前已經有看到 2012 的資料都有在內:

The Facebook source said the investigation so far indicates between 200 million and 600 million Facebook users may have had their account passwords stored in plain text and searchable by more than 20,000 Facebook employees. The source said Facebook is still trying to determine how many passwords were exposed and for how long, but so far the inquiry has uncovered archives with plain text user passwords dating back to 2012.

另外的重點是這些資料已經被內部拿來大量搜尋 (喔喔):

My Facebook insider said access logs showed some 2,000 engineers or developers made approximately nine million internal queries for data elements that contained plain text user passwords.

另外是 Legal 與 PR 都已經啟動處理了,對外新聞稿會美化數字,降低傷害:

“The longer we go into this analysis the more comfortable the legal people [at Facebook] are going with the lower bounds” of affected users, the source said. “Right now they’re working on an effort to reduce that number even more by only counting things we have currently in our data warehouse.”

另外也會淡化後續的程序:

Renfro said the company planned to alert affected Facebook users, but that no password resets would be required.

去年的另外一則新聞可以交叉看:「Facebook’s security chief is leaving, and no one’s going to replace him」:

Instead of building out a dedicated security team, Facebook has dissolved it and is instead embedding security engineers within its other divisions. “We are not naming a new CSO, since earlier this year we embedded our security engineers, analysts, investigators, and other specialists in our product and engineering teams to better address the emerging security threats we face,” a Facebook spokesman said in an email. Facebook will “continue to evaluate what kind of structure works best” to protect users’ security, he said.

看起來又要再換一次密碼了... (還好已經習慣用 Password Manager,所以每個站都有不同密碼?)

喔對,另外補充一個概念,當他們說「我們沒有證據有人存取了...」的時候,比較正確的表達應該是「我們沒有稽核這塊... 所以沒有證據」。

在 Terminal 看資料的 VisiData

在「VisiData」這篇看到的專案,專案的頁面在「A Swiss Army Chainsaw for Data」這邊,從 screenshot 可以看出來是 terminal 的檢視工具:

會注意到是因為支援 .xls

explore new datasets effortlessly, no matter the format: vd foo.json bar.csv baz.xls

SUPPORTED SOURCES 這邊可以看到完整的支援清單,居然連 pcap 也支援,不知道看起來如何 :o

PostgreSQL 對 fsync() 的修正

上次寫了「PostgreSQL 對 fsync() 的行為傷腦筋...」提到 fsync() 有些地方是與開發者預期不同的問題,但後面忘記跟進度...

剛剛看到 Percona 的人寫了「PostgreSQL fsync Failure Fixed – Minor Versions Released Feb 14, 2019」這篇才發現在 2/14 就出了對應的更新,從 release notes 也可以看到:

By default, panic instead of retrying after fsync() failure, to avoid possible data corruption (Craig Ringer, Thomas Munro)

Some popular operating systems discard kernel data buffers when unable to write them out, reporting this as fsync() failure. If we reissue the fsync() request it will succeed, but in fact the data has been lost, so continuing risks database corruption. By raising a panic condition instead, we can replay from WAL, which may contain the only remaining copy of the data in such a situation. While this is surely ugly and inefficient, there are few alternatives, and fortunately the case happens very rarely.

A new server parameter data_sync_retry has been added to control this; if you are certain that your kernel does not discard dirty data buffers in such scenarios, you can set data_sync_retry to on to restore the old behavior.

現在的 workaround 是遇到 fsync() 失敗時為了避免 data corruption,會直接 panic 讓整個 PostgreSQL 從 WAL replay 記錄,也代表 HA 機制 (如果有設計的話) 有機會因為這個原因被觸發...

不過也另外設計了 data_sync_retry,讓 PostgreSQL 的管理者可以硬把這個 panic 行為關掉,改讓 PostgreSQL 重新試著 fsync(),這應該是在之後 kernel 有修改時會用到...

從 Microsoft SQL Server 轉移到 PostgreSQL 的工具

在「How to Migrate from Microsoft SQL Server to PostgreSQL」這邊看到作者的客戶需要把 Microsoft SQL Server 轉移到 PostgreSQL (但沒有提到原因)。

裡面主要是兩個階段的轉換,第一個階段是 schema 的轉換,作者提到了 dalibo/sqlserver2pgsql 這個用 Perl 寫的工具:

Migration tool to convert a Microsoft SQL Server Database into a PostgreSQL database, as automatically as possible http://dalibo.github.io/sqlserver2pgsql

第二個階段是資料的轉換,是選擇用 Pentaho Data Integration 的 Community Edition:

Pentaho offers various stable data-​centric products. Pentaho Data Integration (PDI) is an ETL tool which provides great support for migrating data between different databases without manual intervention. The community edition of PDI is good enough to perform our task here. It needs to establish a connection to both the source and destination databases. Then it will do the rest of work on migrating data from SQL server to Postgres database by executing a PDI job.

所以用兩個工具串起來... 另外在文章裡面沒提到 stored procedure 之類的問題,應該是他們的客戶沒用到或是很少用到?

DynamoDB Autoscaling 的各種眉眉角角...

AdRollDynamoDB Autoscaling 的踩雷記錄,裡面有些資訊如果不是跳下去玩應該不會注意到 (魔鬼藏在細節裡的感覺):「Managing DynamoDB Autoscaling with Lambda and Cloudwatch」。

第一個提到的問題是 autoscaling 的觀察對象:

Ideally, the table should scale based on the number of requests that we are making , not the number of requests that are successful.

另外一個是 autoscaling 遇到完全不用的情況下不會 scale down,看起來是某種保護機制。但這使得平常只有拿來讀取的表格在跑完 batch job 後得自己處理 write scale down 問題:

Additionally, at the time of implementing this algorithm, the DynamoDB capacity could not be brought down automatically if the consumption was exactly zero, which can happen if you write to your table in batch instead of realtime, for example.

This meant that, when enabling autoscaling, tables that were read in realtime, but written to in batch, still needed manual intervention to bring the write capacity down after our jobs were done writing.

另外一個問題是 scale down 是有次數限制的:

Another interesting point that might bite users is that capacity decreases are an expensive operation for AWS, so they’re limited.

The number of decreases cited in the documentation can be achieved under very special conditions, since you need to have 4 decreases in the first hour of the day plus one for each of the remaining hours, for a total of 4 (first hour) + 23 (1 hourly) = 27.

後面就是自己研究什麼 algorithm 可以調整的更細,然後用 lambda 重寫... 最後省下 30% 的成本:

Here is where we detected our costs for our batch tables dropping to around 30% of the initial cost.

AdRoll 的規模應該是不小,所以為了省 30% 可以花不少力氣在上面...

把 b-mobile 的おかわりSIM 換成 190PadSIM 了

因為有養一個日本號碼的需求 (收簡訊),加上去日本時希望可以有個當地的上網方案,可以在還沒到民宿時使用 (不少民宿會提供分享器讓你帶出去),所以當初辦了 b-mobile 的「b-mobile おかわりSIM 5段階定額」這個方案:第一次的設定費是 JPY¥3000,之後每個月基本費用是 JPY¥630,方案包含了每個月 1GB 的流量,可以付費使用到 5GB 的流量 (額外多收 JPY¥250/GB),用完後限速 200kbps。

一開始的速度不太行,就當作養門號用:「b-mobile 的おかわり的速度」,但後來幾次去日本發現好不少 (Twitch 的 720p 也還看的動),就當作是去日本的網路方案之一了 (會準備備案在需要的時候啟用)。這個方案後來停止申請了,但原來的申請者還是可以繼續用。

後來推出的方案是「b-mobile S 190PadSIM」,從名稱可以看出是設計給平板用的。一樣是 JPY¥3000 的設定費用,但之後每個月的基本費用降到 JPY¥190,不過這個方案只包括了 100MB 的流量,但因為是設計給 Pad 使用者,所以方案設計可以付費使用到 15GB 的流量 (分階段是 JPY¥480/1GB,JPY¥850/3GB,JPY¥1450/6GB,JPY¥2190/10GB 與 JPY¥3280/15GB)。

這個方案就流量單價來說比おかわりSIM 便宜 (差不多是 JPY¥200/GB 上下),不過對於流量在 1GB~2GB 與 3GB~5GB 的部份會變得比較貴,所以切過去也不一定比較好。但可以看到因為基本費用變低不少,對於養門號的人來說省了不少...

先前以為需要重新辦一張卡,就一直沒有動力處理 (設定費用與弄回台灣的成本),直到登入到 b-mobile 後台後發現可以直接改服務就改過去了,要注意的是改完後下個 cycle 才會是新的方案。