抓 PDF 裡文字的問題

Hacker News Daily 上看到的,在講從 PDF 裡面拉文字出來遇到的各種問題:「What's so hard about PDF text extraction?」。

FilingDB 是一家處理歐洲公司資料的公司,可能是開公司時送件的時候要求用 PDF,或是政府單位輸出的時候用 PDF,所以他們必須從這些 PDF 裡面拉出文字分析,然後就能夠讓程式使用:

會這麼難搞的原因是因為 PDF 是設計給輸出端用,而不是語意化用的格式:

The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.

每個字元 (character) 都是可以被獨立控制的物件:

At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs - or even words - but as characters which are painted at certain locations on the page.

然後文章後面都在展示各種 workaround XD