iPhone 換電池恢復效能的事情傳到 Geekbench 後...

在「iPhone 的電池與效能」這篇提到了 iPhone 換電池可以恢復效能,結果 Geekbench (也就是原來在 Reddit 上抱怨的人用的測速軟體) 的 John Poole 從 Geekbench 的回報資料庫裡分析了資料,發現了特別的現象後寫下這篇文章 (於是後來引發一連串報導,以及 Apple 的 PR 事件):「iPhone Performance and Battery Age」。

他先拿 iPhone 6S 分析,這看起來就不太妙:

再拿 iPhone 7 的資料分析,就更確定不妙:

可以看到 iOS 的 10.2.1 與 11.2.0 有奇怪的效能集中區。

後續蘋果也確認會刻意降速:「Apple addresses why people are saying their iPhones with older batteries are running ‘slower’」。

然後最新的發展就不太意外了,開始要打架了:「Days after iPhone battery fiasco, lawsuits against Apple begin to mount」。


reddit 與 4chan 在新聞網路上的獨特性

在「Study finds fringe communities on Reddit and 4chan have high influence on flow of alternative news to Twitter」這邊看到的:

After analyzing millions of posts containing mainstream and alternative news shared on Twitter, Reddit and 4chan, Jeremy Blackburn, Ph.D., and collaborators found that alt-right communities within 4chan, an image-based discussion forum where users are anonymous, and Reddit, a social news aggregator where users vote up or down on posts, have a surprisingly large influence on Twitter.

依照對 reddit4chan 的描述,這兩個媒體對 Twitter 的影響,會讓我聯想到在台灣 Ptt 對各新聞媒體的影響:Ptt 是很多新聞的起點?

"Based on our findings, these smaller, fringe communities on Reddit and 4chan serve as an incubation chamber for a lot of information," said Blackburn, assistant professor of computer science in the UAB College of Arts and Sciences. "The content and talking points are refined until they finally break free and make it to larger, more mainstream communities."

真的研究應該可以看出 Ptt 的影響力?

Reddit 的 Deploy 機制 (的歷史)

Reddit 主要是用 Python 寫的,這邊介紹了他們歷年來的 Code Deploy 系統:「The Evolution of Code Deploys at Reddit」。

最早期的時候 (2007 到 2010) 是用 rsync 更新程式碼,然後跑個迴圈用 ssh 連進去重跑:

# build the static files and put them on the static server
`make -C /home/reddit/reddit static`
`rsync /home/reddit/reddit/static public:/var/www/`

# iterate through the app servers and update their copy
# of the code, restarting once done.
foreach $h (@hostlist) {
    `git push $h:/home/reddit/reddit master`
    `ssh $h make -C /home/reddit/reddit`
    `ssh $h /bin/restart-reddit.sh`

2011 時因為人變多了,用 IRC 把過程丟出來 (okay,我知道你想問的問題... Slack 是 2013 年推出的):

The process for actually doing the deploy looked the same, but now the system did the work for you and told everyone what you were doing.

另外值得一提的是,因為他們不是自己架 IRC server 而是用外面第三方的伺服器,所以他們決定 IRC 只有單向告知的功能:

There was a lot of talk of systems that managed deploys from chat around this time, but since we used third party IRC servers we weren’t able to fully trust the chat room with production control and so it remained a one-way flow of information.

2012 時則是把機器列表放到 DNS 上,某種 service discovery 系統:

First, it fetched its list of hosts from DNS rather than keeping it hard-coded. This allowed us to update the list of hosts without having to remember to update the deploy tool as well — a rudimentary service discovery system.

另外是固定的版本,而非拉 master 下來,這樣可以避免 race condition 的不一致性 (推到一半有人把 code 塞進 master):

Another small but important change was to always deploy a fixed version of the code. The previous version of the tool would update master on a given host, but what if master changed mid-deploy because someone accidentally pushed up code? By deploying a specific git revision instead of branch name, we ensured that the deploy got the same version everywhere in production.

2013 往雲上搬,於是遇到像是「開新機器時剛好在 deploy 會拉到舊的 code」這種 edge case。

What happens if a server is launched while a deploy is ongoing? We had to make sure each newly launched server checked in to get new code if present. What about servers going away mid-deploy? The tool had to be made smarter to detect when the server was gone legitimately rather than there being an issue with the deploy process itself that should be noisily alerted on.

2014 遇到機器數量太多,推一輪要一個小時而被迫要平行化處理:

Over time, the number of servers needed to serve peak traffic grew. This meant that deploys took longer and longer. At its worst, a normal deploy took close to an hour. This was not good.

2015 則是加上 deploy lock,避免同時間有兩個人在 deploy:

Engineers would ask for the deploy lock and either get it or get put in the queue. This helped keep order in deploys and let people relax a bit while waiting for the lock.

2017 的部份則是提到了伺服器的數量:

This new mechanism allows us to deploy to a lot more machines concurrently, and deploy timings are down to 7 minutes for around 800 servers despite the extra waiting for safety.

看起來到現在還是維持手動 deploy,而不是自動化... 這塊還蠻有趣的 :o

Reddit 在處理 Page View 的方式

Reddit 說明了他們如何處理 pageview:「View Counting at Reddit」。

以 Reddit 的規模有提到兩個重點,第一個在善用 RedisHyperLogLog 這個資料結構,當量大的時候其實可以允許有微小的誤差:

The amount of memory varies per implementation, but in the case of this implementation, we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage!

維基百科上有說明當資料量在 109 這個等級時,用 1.5KB 的記憶體只有 2% 的誤差值:

The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5 kB of memory.

第二個則是寫入允許短時間的誤差 (pageview 不會即時反應),透過批次處理降低對 Cassandra cluster 的負荷:

Writes to Cassandra are batched in 10-second groups per post in order to avoid overloading the cluster.

可以注意到把 Redis 當作 cache 層而非 storage 層。

主要原因應該跟 Redis 定位是 data structure server 而非 data structure storage 有關 (可以從對 Durability 的作法看出來),而使用 Cassandra 存 key-value 非常容易 scale,但讀取很慢。剛好兩個相輔相成。

Reddit 在規劃要禁止「阻擋 Adblock 的網站」

如標題所說的,Reddit 在規劃這些阻擋 Adblock 的網站:「Mod Announcement: We're considering banning all domains that require users to disable ad blockers and we'd like your input」。

這些網站要求使用者將網站列入 Adblock 白名單,然後這些網站就會「不小心」推送 malware 給使用者:

It has come to our attention that many websites such as Forbes and Wired are now requiring users to disable ad blockers to view content. Because Forbes requires users to do this and has then served malware to them we see this as a security risk to you our community. There are also sites such as Wall Street Journal that have implemented pay-walls which we were are also considering banning.



在「Kudos - A Peer-to-Peer Discussion System Based on Social Voting」這邊看到分散式的論壇系統,帶有投票分數機制以及相關議題機制:

Decentralized Reddit using a DHT to store content and a blockchain to rank such content. Whitepaper with more details here: http://lucaa.org/docs/kudos.pdf

論文裡面可以看出來設計的觀念受到 Bitcoin 的啟發,演算法也是... 換句話說,Bitcoin 帶來的影響遠遠超過金融市場,Bitcoin 所使用的理論也給其他領域很多想法。

如果這樣的系統可行的話 (還沒仔細研究 @_@),真正分散式的論壇系統就會出現了...

用 Google BigQuery 分析 Reddit 釋出的資料

前幾天有提到 Reddit 官方放出全站的投稿以及 comment 資訊:「Reddit 放出完整的全站投稿資料」,馬上就有人拿工具來分析了:「How to Analyze Every Reddit Submission and Comment, in Seconds, for Free」。

這篇的作者是用 GoogleBigQuery 分析,而 BigQuery 跟 SQL 操作方法類似,所以我猜用 Amazon Redshift 或是 Apache Spark 應該都可以做到類似的事情吧,就看對工具的熟悉度。圖片則是透過 BigQuery 產生 csv 擋,再透過 Rggplot2 產生出來。

作者給的每張圖都有提供對應的 SQL-like 查詢語法,每張圖的意義也都直接把說明寫上去了,結果還蠻有趣的:

Reddit 放出完整的全站投稿資料

前幾天 Reddit 宣佈放出完整的全站投稿資料:「Full Reddit Submission Corpus now available (2006 thru August 2015)」,有些技術問題使得這次沒放出 2006 與 2007 的資料,之後會想辦法補上:

Data is complete from January 01, 2008 thru August 31, 2015. Partial data is available for years 2006 and 2007. The reason for this is that the id's used when Reddit was just a baby were scattered a bit -- but I am making an attempt to grab all data from 2006 and 2007 and will make a supplementary upload for that data once I'm satisfied that I've found all data that is available.

約 42GB 的資料,幾乎是公開的資料都包含進去了:

This dataset represents approximately 200 million submission objects with score data, author, title, self_text, media tags and all other attributes available via the Reddit API.

檔案放在 Amazon S3 上,不過有人貼出對應的 BitTorrent 連結了,最重要的 btih 值是 9941b4485203c7838c3e688189dc069b7af59f2e。