The data is from the entire Brown corpus in the Natural Language Toolkit. It's a smaller and out-of-date corpus, but it's open source and easy to obtain. I repeated the analysis with COHA, the Corpus of Historical American English, a well-curated, proprietary data set from Brigham Young University for which I have a license, and the only differences were in rare letters like "z" or "x".
This bug was reported to Chrome and Firefox on January 20, 2017 and was fixed in the Chrome trunk on March 24. The fix is included in Chrome 58 which is currently rolling out to users.
We do have a whitelist. Essentially you're suggesting that we remove Cyrillic and Greek characters from the list. I'm not sure we want to go down that path.
Indeed. Our IDN threat model specifically excludes whole-script homographs, because they can't be detected programmatically and our "TLD whitelist" approach didn't scale in the face of a large number of new TLDs. If you are buying a domain in a registry which does not have proper anti-spoofing protections (like .com), it is sadly the responsibility of domain owners to check for whole-script homographs and register them.
We can't go blacklisting standard Cyrillic letters.
If you think there is a problem here, complain to the .com registry who let you register https://www.xn--80ak6aa92e.com/ .
Gerv
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Flags: needinfo?(gerv)
Resolution: --- → WONTFIX
然後一個月前被提出來看看 Chrome 怎麼做:
Gerv/Valentin, is this something we can/should align with Chromium on?
The release of more than 180,000 digitized items represents both a simplification and an enhancement of digital access to a trove of unique and rare materials: a removal of administration fees and processes from public domain content, and also improvements to interfaces — popular and technical — to the digital assets themselves.