extraction

在 Hacker News Daily 上看到的，在講從 PDF 裡面拉文字出來遇到的各種問題：「What's so hard about PDF text extraction?」。

FilingDB 是一家處理歐洲公司資料的公司，可能是開公司時送件的時候要求用 PDF，或是政府單位輸出的時候用 PDF，所以他們必須從這些 PDF 裡面拉出文字分析，然後就能夠讓程式使用：

會這麼難搞的原因是因為 PDF 是設計給輸出端用，而不是語意化用的格式：

The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.

每個字元 (character) 都是可以被獨立控制的物件：

At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs - or even words - but as characters which are painted at certain locations on the page.

然後文章後面都在展示各種 workaround XD

看到「Mercury Goes Open Source!」這篇，Postlight 的團隊開源了 Mercury Web Parser，程式碼在 GitHub 上的 postlight/mercury-parser 可以取得。

這個版本是用 Node.js 寫的，可以從範例看出用法以及結果：

import Mercury from '@postlight/mercury-parser';
Mercury.parse(url).then(result => console.log(result););

{
  "title": "Thunder (mascot)",
  "content": "<div><div><p>This is the content of the page!</div></div>",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

先前其他的軟體與服務可以參考「Evaluating Text Extraction Algorithms」這篇的整理與比較，不過這篇連原網站都不見了... 只能從 Internet Archive 上翻出來。

這個主題有不少團隊都做過 (給一個 html 網頁，抓出實際的內容塊落)，但也死了不少團隊... 比較有印象的是 Readability，在 2016 年收掉了：「The Readability bookmarking service will shut down on September 30, 2016.」。

要撈資料可以拿來用...

Tag: extraction

抓 PDF 裡文字的問題

Mercury Web Parser 開源