字母在單字裡的位置分佈

是一篇老文章了... (2014 年的文章,最近從其他地方提起)

這邊講的是英文,不過同樣方式也可以拿來分析其他語言:「The distribution of letters in English words」,原始文章在「Graphing the distribution of English letters towards the beginning, middle or end of words」。

原文有描述他的資料分析來源:

The data is from the entire Brown corpus in the Natural Language Toolkit. It's a smaller and out-of-date corpus, but it's open source and easy to obtain. I repeated the analysis with COHA, the Corpus of Historical American English, a well-curated, proprietary data set from Brigham Young University for which I have a license, and the only differences were in rare letters like "z" or "x".

利用 Unicode Domain 釣魚,以及 Chrome 與 Firefox 的解法

一個多禮拜前引起蠻多討論的一篇文章,利用 Unicode Domain 釣魚的方法:「Phishing with Unicode Domains」。

由於這是幾乎完美的攻擊,所以被提出來後 (Security: Whole-script confusable domain label spoofing) 有不少討論:

This bug was reported to Chrome and Firefox on January 20, 2017 and was fixed in the Chrome trunk on March 24. The fix is included in Chrome 58 which is currently rolling out to users.

comment 8 提到:

We do have a whitelist. Essentially you're suggesting that we remove Cyrillic and Greek characters from the list. I'm not sure we want to go down that path.

在新版的 Chrome 58 已經「修正」了這個問題:

Firefox 的討論在「IDN Phishing using whole-script confusables on Windows and Linux」這邊,一開始就直接把票給關了 XDDD:

Indeed. Our IDN threat model specifically excludes whole-script homographs, because they can't be detected programmatically and our "TLD whitelist" approach didn't scale in the face of a large number of new TLDs. If you are buying a domain in a registry which does not have proper anti-spoofing protections (like .com), it is sadly the responsibility of domain owners to check for whole-script homographs and register them.

We can't go blacklisting standard Cyrillic letters.

If you think there is a problem here, complain to the .com registry who let you register https://www.xn--80ak6aa92e.com/ .

Gerv

Status: NEW → RESOLVED
Last Resolved: 3 months ago
Flags: needinfo?(gerv)
Resolution: --- → WONTFIX

然後一個月前被提出來看看 Chrome 怎麼做:

Gerv/Valentin, is this something we can/should align with Chromium on?

目前唯一的解法是改 flag,把所有的 Unicode Domain 直接當作一般的 domain 來處理,列出像是 www.xn--80ak6aa92e.com 的網址。

紐約公共圖書館放出十八萬張數位高畫質的數位資料

紐約公共圖書館這次放出了十八萬張數位資料,包括歷史照片、地圖以及信件:「The New York Public Library Lets You Download 180,000 Images in High Resolution: Historic Photographs, Maps, Letters & More」,圖書館官方的公告在「Free for All: NYPL Enhances Public Domain Collections For Sharing and Reuse」這邊:

The release of more than 180,000 digitized items represents both a simplification and an enhancement of digital access to a trove of unique and rare materials: a removal of administration fees and processes from public domain content, and also improvements to interfaces — popular and technical — to the digital assets themselves.

除了可以在「NYPL Digital Collections」這邊搜尋下載外,還有 API 可以用:「The New York Public Library Digital Collections API」,在 GitHub 上也有工具可以使用:「Digital Collections Public Domain Item Data and Tools」。

而且這 18 萬張資料是完全的開放,不需要事先取得館方授權:

No permission required, no hoops to jump through: just go forth and reuse!

將 public domain 的文物數位化,傳遞與保存變的更便利... (也讓做研究的人更容易取得資料)