用 PublicWWW 分析網站

在「Keylogger Found on Nearly 5,500 Infected WordPress Sites」這邊看到的網站服務 PublicWWW

雖然原文是說 WordPress 被感染的情況,但注意到的反而是他提到的網站 PublicWWW。

在 PublicWWW 上面目前收錄了兩億個網站的資料,有些東西頗不賴的,像是可以搜尋有哪些是使用同樣的 Google Analytics 帳號:

Sites with the same analytics id: "UA-19778070-"

這拿來找誰是內容容場後面的人超棒的啊,而且可以拿來補內容農場的清單,像是「UA-31425034 - 19 Websites - PublicWWW.com」這個 XD

免費版只能搜 Top 3M 的部份,付費版 (USD$49/month) 則是可以搜所有的資料。

Million Dollar Homepage 上的網站存活率

作者在分析 2005 年炒熱話題的 Million Dollar Homepage,上面所列的網站的存活比率:「A Million Squandered: The “Million Dollar Homepage” as a Decaying Digital Artifact」。

這是作者在 2017 年抓的截圖:

而這是分析圖:

作者跑程式分析,其中大約一半的 pixel 已經失效:

The 547 unreachable links are attached to graphical elements that collectively take up 342,000 pixels (face value: $342,000). Redirects account for a further 145,000 pixels (face value: $145,000).

不過如果以網站數量來看,則大約還有 63% 活著:

Of the 2,816 links that embedded on the page (accounting for a total of 999,400 pixels), 547 are entirely unreachable at this time. A further 489 redirect to a different domain or to a domain resale portal, leaving 1,780 reachable links.

突然有種「銀河的歷史又翻過了一頁」的感覺 XDDD

Facebook 與 Google Chrome 以及 Firefox 的人合作降低 Reload 使用的資源

Facebook 花了不少時間對付 reload 這件事情:「This browser tweak saved 60% of requests to Facebook」。

Facebook 的人發現有大量對靜態資源的 request 都是 304 (not modified) 回應:

In 2014 we found that 60% of requests for static resources resulted in a 304. Since content addressed URLs never change, this means there was an opportunity to optimize away 60% of static resource requests.

Google Chrome 很明顯偏高:

於是他們找出原因後,發現 Google Chrome 只要 POST 後的頁面都會 revalidate:

A piece of code in Chrome hinted at the answer to our question. This line of code listed a few reasons, including reload, for why Chrome might ask to revalidate resources on a page. For example, we found that Chrome would revalidate all resources on pages that were loaded from making a POST request.

然後在討論後認為這個行為不必要,就修掉了,可以看到降了非常多:

We worked with Chrome product managers and engineers and determined that this behavior was unique to Chrome and unnecessary. After fixing this, Chrome went from having 63% of its requests being conditional to 24% of them being conditional.

但還是很明顯比起其他瀏覽器偏高不少,在追問題後發現當輸入同樣的 url 時 (像是 Ctrl-L 或是 Cmd-L 然後直接按 enter),Google Chrome 會當作 reload:

The fact that the percentage of conditional requests from Chrome was still higher than other browsers seemed to indicate that we still had some opportunity here. We started looking into reloads and discovered that Chrome was treating same URL navigations as reloads while other browsers weren't.

不過這次推出修正後發現沒有大改變:(拿 production 測試 XDDD)

Chrome fixed the same URL behavior, but we didn't see a huge metric change. We began to discuss changing the behavior of the reload button with the Chrome team.

後來是針對 reload button 的行為修改,max-age 很長的就不 reload,比較短的就 reload。算是一種 workaround:

There was some debate about what to do, and we proposed a compromise where resources with a long max-age would never get revalidated, but that for resources with a shorter max-age the old behavior would apply. The Chrome team thought about this and decided to apply the change for all cached resources, not just the long-lived ones.

Google 也發了一篇說明這個新功能:「Reload, reloaded: faster and leaner page reloads」。

當 Facebook 的人找 Firefox 的人時,Firefox 決定另外定義哪些東西在 reload 時不需要 revalidate,而不像 Google Chrome 的 workaround:

Firefox chose to implement this directive in the form of a cache-control: immutable header.

Firefox 的人也寫了一篇「Using Immutable Caching To Speed Up The Web」解釋這個新功能。

所以之後規劃前後端的架構時又有東西要考慮進去...

Google 的 www.google.com 要上 HSTS 了

Google 宣佈 www.google.com 要上 HSTS 了:「Bringing HSTS to www.google.com」。

雖然都已經使用 EFFHTTPS Everywhere 在跑確保 HTTPS,但這個進展還是很重要,可以讓一般使用者受到保護...

Google 的切換計畫是會逐步增加 HSTS 的 max-age,確保中間出問題時造成的衝擊。一開始只會設一天,然後會逐步增加,最後增加到一年:

In the immediate term, we’re focused on increasing the duration that the header is active (‘max-age’). We've initially set the header’s max-age to one day; the short duration helps mitigate the risk of any potential problems with this roll-out. By increasing the max-age, however, we reduce the likelihood that an initial request to www.google.com happens over HTTP. Over the next few months, we will ramp up the max-age of the header to at least one year.

CloudFlare 放出可以同時支援 HTTP/2 與 SPDY 的 nginx patch

CloudFlare 放出可以同時支援 HTTP/2SPDYnginx patch:「Open sourcing our NGINX HTTP/2 + SPDY code」。

不過 patch 的版本有點舊:

We've extracted our changes and they are available as a patch here. This patch should build cleanly against NGINX 1.9.7.

如同原文下面 comment 提到的問題,nginx 1.9.7 太舊了 (2015/11/17 放出的版本),到現在有出了不少安全性更新,以及對 HTTP/2 的 bugfix。應該會需要再等官方把新版的 patch 拿出來改之後才能用。

利用 HSTS 資訊得知網站紀錄的 sniffly

看到「sniffly」這個工具,可以利用 HSTS 資訊檢測逛過哪些網站,程式碼在「diracdeltas/sniffly」這邊可以找到:

Sniffly is an attack that abuses HTTP Strict Transport Security and Content Security Policy to allow arbitrary websites to sniff a user's browsing history. It has been tested in Firefox and Chrome.

測試網站則可以在這邊看到,作者拿 Alexa 上的資料網站來掃,所以熱門網站應該都會被放進去...

主要是利用 HSTS + CSP policy 的 timing attack (有逛過網站而瀏覽器裡有 HSTS 時的 redirect 會比較快,沒有逛過的時候會因為有網路連線而比較慢):

Sniffly sets a CSP policy that restricts images to HTTP, so image sources are blocked before they are redirected to HTTPS. This is crucial! If the browser completes a request to the HTTPS site, then it will receive the HSTS pin, and the attack will no longer work when the user visits Sniffly.

When an image gets blocked by CSP, its onerror handler is called. In this case, the onerror handler does some fancy tricks to time how long it took for the image to be redirected from HTTP to HTTPS. If this time is on the order of a millisecond, it was an HSTS redirect (no network request was made), which means the user has visited the image's domain before. If it's on the order of 100 milliseconds, then a network request probably occurred, meaning that the user hasn't visited the image's domain.

由於這個技巧,HTTPS Everywhere 必須關閉才會比較準確。

nginx 1.9.6 釋出

nginx 的官網上可以直接看到連結,點進 CHANGES 後可以看到兩項關於 HTTP/2 的修正:

    *) Bugfix: a segmentation fault might occur in a worker process when
       using HTTP/2.
       Thanks to Piotr Sikora and Denis Andzakovic.

    *) Bugfix: the $server_protocol variable was empty when using HTTP/2.

    *) Bugfix: backend SSL connections in the stream module might be timed
       out unexpectedly.

    *) Bugfix: a segmentation fault might occur in a worker process if
       different ssl_session_cache settings were used in different virtual
       servers.

    *) Bugfix: nginx/Windows could not be built with MinGW gcc; the bug had
       appeared in 1.9.4.
       Thanks to Kouhei Sutou.

    *) Bugfix: time was not updated when the timer_resolution directive was
       used on Windows.

    *) Miscellaneous minor fixes and improvements.
       Thanks to Markus Linnala, Kurtis Nusbaum and Piotr Sikora.

關於 HTTP/2 的錯誤修正意外的少 (畢竟 1.9.5 是第一個正式版),看起來 codebase 已經穩下來了?話說 NGINX Mainline 這邊是不打算更新了嗎...

幾個使用 LWP (WWW::Mechanize) 的好習慣...

WWW::Mechanize 是繼承 libwww-perl 中 LWP::UserAgent 的工具,所以這個方式通用於兩者...

首先是在取得資料時加上 Accept-Encoding: gzip 的 header,這可以在產生物件時指定,之後就都會送出:

my $ua = LWP::UserAgent->new;
$ua->default_header('Accept-Encoding' => 'gzip');

取回後直接透過 HTTP::Response 的 decoded_content 幫你處理:

my $res = $ua->get($uri);
my $msg = $res->decoded_content;

取回的 $msg 會是 Perl internal encoding,要變成 UTF-8 還需要透過 Encodeencode('utf8', $msg)

這樣只要 server-side 有支援 gzip 就會自動壓縮並且在本地端解開,沒壓縮也沒關係,HTTP::Response 會依照 Content-Type 處理。

Updateclkao 指出 WWW::Mechanize 預設值就會包括 gzip 設定,剛剛拿 code.jquery.com 測試沒錯。