word – Gea-Suan Lin's BLOG

Google Docs 裡 Grammar Correction 的 bug

剛剛在 Hacker News 上看到有趣的 bug，在 Google Docs 上輸入 And. And. And. And. And. 會觸發 error：「Including “And. And. And. And. And.” in a Google doc causes it to crash (support.google.com)」，原始的 bug report 在「Including "And. And. And. And. And." in a Google doc causes it to crash.」這邊，錯誤訊息像是這樣：

Hacker News 上的討論有提到這需要開 grammar check 的功能，然後看起來只要有相同的五個字開頭都大寫就會發生，像是 Also, Therefore, And, Anyway, But, Who, Why. 這些：

Also, Therefore, And, Anyway, But, Who, Why.

Each in caps 5 times with the same word with a period and space after each word and newline at the end is what I have found so far.

Can anyone find others?

Edit: added words that work found in other comments

很有趣的 bug XDDD 然後目前在 Hacker News 首頁的第一名...

一個有趣的面試問題

在 Hacker News Daily 上看到「Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust」這個有趣的面試問題，在 Hacker News 上的討論也可以看看：「Performance comparison: counting words in Python, Go, C++, C, Awk, Forth, Rust (benhoyt.com)」。

問題是這樣：

Write a program to count the frequencies of unique words from standard input, then print them out with their frequencies, ordered most frequent first. For example, given this input:
The foo the foo the
defenestration the
The program should print the following:
the 4
foo 2
defenestration 1

在面試時，重點不在於用哪個程式語言，而是在面試時一路往下問，像是 profiling 的部份，內部資料結構的部份，可以問得很深。

撇開面試，這個問題其實是個經典問題，當年 Donald Knuth 與 Doug McIlroy 兩位大師都有玩過：

Incidentally, this problem set the scene for a wizard duel between two computer scientists several decades ago. In 1986, Jon Bentley asked Donald Knuth to show off “literate programming” with a solution to this problem, and he came up with an exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the inventor of Unix pipelines) replied with a one-liner Unix shell version using tr, sort, and uniq.

不過當年玩的問題有點變形：

Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

他們當時其實一個是在示範 Literate programming，而另外一個在展現 pipe 的威力，都是借題發揮而已，跟上面的那些東西倒是沒什麼太大關係。

Facebook 修正錯字的新演算法

先前 Facebook 已經先發表過 fastText 了，在這個月的月初又發表了另外一個演算法 Misspelling Oblivious Embeddings (MOE)，是搭著本來的 fastText 而得到的改善：「A new model for word embeddings that are resilient to misspellings」。

Facebook 的說明提到在 user-generated text 的內容上，MOE 的效果比 fastText 好：

We checked the effectiveness of this approach considering different intrinsic and extrinsic tasks, and found that MOE outperforms fastText for user-generated text.

論文發表在 arXiv 上：「Misspelling Oblivious Word Embeddings」。

依照介紹，fastText 的重點在於 semantic loss，而 MOE 則多了 spell correction loss：

The loss function of fastText aims to more closely embed words that occur in the same context. We call this semantic loss. In addition to the semantic loss, MOE also considers an additional supervisedloss that we call spell correction loss. The spell correction loss aims to embed misspellings close to their correct versions by minimizing the weighted sum of semantic loss and spell correction loss.

不過目前 GitHub 上的 facebookresearch/moe 只有放 dataset，沒有 open source 出來讓人直接用，可能得自己刻...

Word2Vec：透過向量猜測其他詞彙的意思

2013 年時在「Automatic Translation Without Dictionaries」這邊看到關於機器翻譯時的自我學習方式，裡面提到了「How Google Converted Language Translation Into a Problem of Vector Space Mathematics」這篇報導，而裡面提到的論文則是 Google 發表在 arXiv 上的「Exploiting Similarities among Languages for Machine Translation」這篇。

最近看到「The Illustrated Word2vec」這篇，把五年多前的記錄交叉拉出來看... 這個算式算是給了大家基本的想法，透過公式來解釋文字的意義：

拉出這樣的關係後，就有機會學習新的詞彙... 進而用在其他語言的翻譯上。

字母在單字裡的位置分佈

是一篇老文章了... (2014 年的文章，最近從其他地方提起)

這邊講的是英文，不過同樣方式也可以拿來分析其他語言：「The distribution of letters in English words」，原始文章在「Graphing the distribution of English letters towards the beginning, middle or end of words」。

原文有描述他的資料分析來源：

The data is from the entire Brown corpus in the Natural Language Toolkit. It's a smaller and out-of-date corpus, but it's open source and easy to obtain. I repeated the analysis with COHA, the Corpus of Historical American English, a well-curated, proprietary data set from Brigham Young University for which I have a license, and the only differences were in rare letters like "z" or "x".

對純文字格式的推廣

在「The future of education is plain text」這邊看到對純文字格式的推廣 XDDD

作者提出了五點：

Plain text is always compatible
Plain text is easy to mix and match
Plain text is easy to maintain
Plain text is lightweight
Plain text is always forward compatible

是沒錯啦，把力氣專注在內容本身，而不是一堆格式上...

這讓我想起 George R. R. Martin 在 Conan 上提到他是用 WordStar 4.0 寫《冰與火之歌》的採訪片段：

印 "#" 比印 "B" 來的快的問題

這篇是兩年前在 StackOverflow 上的問題：「Why is printing “B” dramatically slower than printing “#”?」。

問問題的人這段程式跑了 8.52 秒：

Random r = new Random();
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 1000; j++) {
        if(r.nextInt(4) == 0) {
            System.out.print("O");
        } else {
            System.out.print("#");
        }
    }

   System.out.println("");
 }

而把上面的 # 換成 B 就變成 259.152 秒。

答案是與 word-wrapping 有關：

Pure speculation is that you're using a terminal that attempts to do word-wrapping rather than character-wrapping, and treats B as a word character but # as a non-word character. So when it reaches the end of a line and searches for a place to break the line, it sees a # almost immediately and happily breaks there; whereas with the B, it has to keep searching for longer, and may have more text to wrap (which may be expensive on some terminals, e.g., outputting backspaces, then outputting spaces to overwrite the letters being wrapped).

But that's pure speculation.

這真是細節 XDDD