Home » Posts tagged "data"

Twitter 的 280 字帶來的差異

在「140 Vs. 280: Users Engage With Longer Tweets Data Shows」這邊分析了在 Twitter 上 0~140 與 141~280 字的 tweet 所帶來的互動差異:

可以看到較長的 tweet 會有比較多的 retweet 與 like,不過更細一步的分析就沒有了... 文章內也有提到資料的分析是怎麼來的:

The data parameters: 30,000 publisher tweets that included links between November 29 – December 6.
The results: The click-through rate was roughly equal for both tweet length types but overall engagement nearly doubled for longer tweets. On tweets containing 141-280 characters, the average retweet was a staggering 26.52% – compared the 13.71% for tweets with 0-140 characters. For likes, tweets containing 141-280 characters had an average of a whopping 50.28%, compared to 0-140’s 26.96%.

Amazon CloudWatch 支援縮放與拖拉調整時間區間

Amazon CloudWatch 的操作上支援 Zoom 與 Pan 了:「Amazon CloudWatch now supports two new chart visualization options in metrics and dashboards」。

Zoom 是改變時間的粒度:

You can use the CloudWatch console to graph metric data generated by AWS services and your applications. Now, you can zoom into a shorter time period such as one minute or five minutes while viewing the metric graph at a longer interval.

Pan 則是維持一樣的粒度,但改變開始與結束的時間:

Once zoomed, you can also pan the metric graph across your selected interval, but at a zoomed detail level.


LINE 將內部的座位表由 Excel 改成 Web 界面...

LINE 將內部的座位表由 Excel 管理,改用 Web 界面了:「Excel管理の座席表をLeafletでWeb化した話」,這邊不確定是全球的 LINE,還是只有日本的 LINE...

如果跟日本人有過業務合作的話,就會知道他們對 Excel 的用法只能用


來形容啊... 所以看到 LINE 特地寫了一篇來說明他們開發內部系統的事情,覺得還蠻有趣的...

起因是今年四月換辦公室,所以就順便換系統,把本來用 Excel 管理的座位表改用 Web 管理 (然後用了 Leaflet 這個 JavaScript Library):


這是 Excel 版本的樣子:

這是新版本的樣子,UI 上有更多互動的界面可以操作:

然後文末提到了總務業務量減少,而且因此變更座位變自由了而大受好評 (大概是不會讓總務煩死,所以就可以更自由換來換去 XDDD):


翻了一下英文版的 blog,好像沒有提到這件事情?XDDD

南韓出手調查 Google 未經同意蒐集位置資訊的問題了

在「就算關掉 Google 的定位服務也還是會蒐集位置資訊...」這邊提到的蒐集問題,南韓出手調查了:「Regulators question Google over location data」。

Regulators in South Korea summoned Google (GOOGL, Tech30) representatives this week to question them about a report that claimed the company was collecting data from Android devices even when location services were disabled.


U.K. data protection officials are also looking into the matter.

美國與歐盟其他國家反而還沒看到消息... (不過美國有可能是以訴訟的方式進行)

Imgur 也漏資料了... (帳號與密碼)

Imgur 官方發佈公告說明他們發現資料洩漏了:「Notice of Data Breach」,資料的洩漏是發生在 2014 年,包括了帳號與密碼:

Early morning on November 24th, we confirmed that approximately 1.7 million Imgur user accounts were compromised in 2014. The compromised account information included only email addresses and passwords. Imgur has never asked for real names, addresses, phone numbers, or other personally-identifying information (“PII”), so the information that was compromised did NOT include such PII.

然後 2014 年用的是 SHA-256

We have always encrypted your password in our database, but it may have been cracked with brute force due to an older hashing algorithm (SHA-256) that was used at the time. We updated our algorithm to the new bcrypt algorithm last year.

以單台八張 GTX 1080hashcat 的速度來看 (出自「8x Nvidia GTX 1080 Hashcat Benchmarks」),是 23GH/z 左右:

Hashtype: SHA256

Speed.Dev.#1.: 2865.2 MH/s (96.18ms)
Speed.Dev.#2.: 2839.8 MH/s (96.65ms)
Speed.Dev.#3.: 2879.5 MH/s (97.14ms)
Speed.Dev.#4.: 2870.6 MH/s (96.32ms)
Speed.Dev.#5.: 2894.2 MH/s (96.64ms)
Speed.Dev.#6.: 2857.7 MH/s (96.78ms)
Speed.Dev.#7.: 2899.3 MH/s (96.46ms)
Speed.Dev.#8.: 2905.7 MH/s (96.26ms)
Speed.Dev.#*.: 23012.1 MH/s

這對於鍵盤可以打出的所有字元來計算 (95 chars),八個字的密碼只要 3.33 天就可以跑完;如果只考慮英文數字 (62 chars),九個字的密碼只要 6.81 天。

這些還不是最新的 GPU,而且是單機計算,對於現在的攻擊應該會用 ASIC,可以考慮多三到四個數量級的數度在算 (看財力才會知道買多少機器)。

不過 Imgur 的帳號主要是參與討論 (因為不用帳號密碼也可以上傳圖片),一般比較不會在上面註冊... 真的有註冊的因為沒有其他個資,主要是怕共用密碼的問題。如果有用 password manager 應該也還好。

Amazon 西雅圖辦公室拿隔壁棟 Data Center 的廢熱當空調

Amazon 的其中一個辦公室拿隔壁 data center 的廢熱借來當自己辦公室的空調:「Amazon to use data centre waste heat to warm corporate offices」,原始報導在「The super-efficient heat source hidden below Amazon's Seattle headquarters」。除了嘗試省電省成本以外,對企業形象也比較好...

隔壁 Westin Building Exchange 的地址是「2001 6th Ave #300, Seattle, WA 98121」,辦公室則是在「2040 6th Ave, Seattle, WA 98121」,無論是從地址上看,或是 Google Maps 上可以看,都可以看出來兩棟就在旁邊而已,拉管線就簡單很多了。

預定二十五年省 80M 度電,所以一年大約是 3.2M 度,以「Seattle, WA Electricity Rates | Electricity Local」這邊給的數字來算,商業用店每度是 USD$0.068,每年大約省下 USD$217,600 (所以每年大約可以省下台幣六百萬),以 3800 人的辦公室來說其實有點微妙,不過以 PR 的角度還看其實就很划算了 XDDD:

It is expected, over the course of 25 years, to save approximately 80 million kWh of electricity use by Amazon.


各家 Session Replay 服務對個資的處理

Session Replay 指的是重播將使用者的行為錄下來重播,市面上有很多這樣的服務,像是 User Replay 或是 SessionCam

這篇文章就是在討論這些服務在處理個資時的方式,像是信用卡卡號的內容,或是密碼的內容,這些不應該被記錄下來的資料是怎麼被處理的:「No boundaries: Exfiltration of personal data by session-replay scripts」,主要的重點在這張圖:

後面有提到目前防禦的情況,看起來目前用 adblock 類的軟體可以擋掉一些服務,但不是全部的都在列表裡。而 DNT 則是裝飾品沒人鳥過:

Two commonly used ad-blocking lists EasyList and EasyPrivacy do not block FullStory, Smartlook, or UserReplay scripts. EasyPrivacy has filter rules that block Yandex, Hotjar, ClickTale and SessionCam.

At least one of the five companies we studied (UserReplay) allows publishers to disable data collection from users who have Do Not Track (DNT) set in their browsers. We scanned the configuration settings of the Alexa top 1 million publishers using UserReplay on their homepages, and found that none of them chose to honor the DNT signal.

Improving user experience is a critical task for publishers. However it shouldn’t come at the expense of user privacy.

Google Cloud Platform 的 DLP API

在「New ways to manage sensitive data with the Data Loss Prevention API」這邊提到三月的時候就推出了 DLP API (在「Discover and redact sensitive data with the Data Loss Prevention API」這邊提到的),不過沒什麼印象:

The Data Loss Prevention (DLP) API, which went beta in March, can help you quickly find and protect over 50 types of sensitive data such as credit card numbers, names and national ID numbers.



以色列黑了 Kaspersky 的系統,然後通報美國機密資料外洩...

前幾天在「俄羅斯政府透過卡巴斯基的漏洞,偷取美國國安局的文件」這邊提到了俄羅斯是透過 Kaspersky 的漏洞取得,後續又有些消息揭露出來了...

這件事情會被抓包,是因為以色列黑進去 Kaspersky 的系統 (???),然後發現美國的機密資料外洩 (??????),於是通報盟友美國後追查出來的 (?????????):「Israel hacked Kaspersky, then tipped the NSA that its tools had been breached」。

這過程是殺小 XDDD

分析 FCC 對網路中立性的留言,將鄉民與機器人分開來分析

Boing Boing 的立場其實還蠻鮮明的,所以有時候他們的新聞看看就好...

但這篇真的很有趣,把 FCC (美國的聯邦通信委員會) 上兩千兩百萬則對網路中立性的留言拿出來分析,結果發現真人與機器人的差異超明顯 XDDD:「Analysis of 22 million FCC comments show that humans love Net Neutrality and bots really, really hate it」,引用的文章在「Discovering truth through lies on the internet - FCC comments analyzed」這邊。

分析可以發現,真人偏好網路中立,而機器人反對網路中立 XDDD:(其實大家心裡都有底是怎麼玩出來的... 只是這次有機會分析,讓事情更明顯)

Data analysis company Gravwell ingested 22,000,000 comments sent to the FCC's docket on Net Neutrality and posted their preliminary findings, which are that the majority of comments came from bots, and these bots oppose Net Neutrality; of the comments that appear to originate with humans, the vast majority favor Net Neutrality.

文章中列出幾個有趣的現象,像是機器人的 comment 大量重複:

A very small minority of comments are unique -- only 17.4% of the 22,152,276 total. The highest occurrence of a single comment was over 1 million.

甚至是 pornhub.com 的郵件位置 XDDD:

Most comments were submitted in bulk and many come in batches with obviously incorrect information -- over 1,000,000 comments in July claimed to have a pornhub.com email address

然後機器人的 pattern 也很容易辨別:

Bot herders can be observed launching the bots -- there are submissions from people living in the state of "{STATE}" that happen minutes before a large number of comment submissions

這個有點類似「50 Cent Party (五毛黨)」,但是是自動化機器人,而且產出的「品質」不太好 XDDD