抓 PDF 裡文字的問題

Hacker News Daily 上看到的,在講從 PDF 裡面拉文字出來遇到的各種問題:「What's so hard about PDF text extraction?」。

FilingDB 是一家處理歐洲公司資料的公司,可能是開公司時送件的時候要求用 PDF,或是政府單位輸出的時候用 PDF,所以他們必須從這些 PDF 裡面拉出文字分析,然後就能夠讓程式使用:

會這麼難搞的原因是因為 PDF 是設計給輸出端用,而不是語意化用的格式:

The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.

每個字元 (character) 都是可以被獨立控制的物件:

At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs - or even words - but as characters which are painted at certain locations on the page.

然後文章後面都在展示各種 workaround XD

用 OpenCV 處理 webcam 的背景,再用 pyfakewebcam 接回給視訊軟體用...

最近武漢肺炎的關係,導致蠻多人會在家裡使用視訊會議,但背景一直是個問題... 然後就看到這篇文章:「Open Source Virtual Background」。

作者用 python-opencv 先處理 webcam 進來的影像 (看起來不只去背,還加上了 hologram 的效果 XDDD),然後再用 pyfakewebcamv4l2loopback 模擬成一個 webcam 餵回給視訊軟體,結果就惡搞成這樣了:

話說回來,最近各電商平台的 webcam 與視訊機都缺貨,還好之前有買個便宜的頂著,不然就得開筆電了...

Cloudflare 同時支援 TLS 1.2 與 TLS 1.3 的過程

Cloudflare 算是很早就參與 TLS 1.3 發展的廠商。在參與過程中他們希望讓支援 TLS 1.3 draft 的瀏覽器可以開始使用 TLS 1.3 draft,但又不希望因為 draft 頻繁修改而導致本來的使用者受到影響,所以就找了方法讓兩者並存:「Know your SCM_RIGHTS」。

這個方法就是 SCM_RIGHTS,可以讓另外一個 process 存取自己的 file description。

You can use UNIX-domain sockets to pass file descriptors between applications, and like everything else in UNIX connections are files.

所以他們的作法就是先讀取 TLS 裡 Client Hello 的資料,如果裡面有看到想要使用 TLS 1.3 的訊息,就透過前面提到的 SCM_RIGHTS 丟進 Golang 寫的程式跑:

We let OpenSSL read the “Client Hello” message from an established TCP connection. If the “Client Hello” indicated TLS version 1.3, we would use SCM_RIGHTS to send it to the Go process. The Go process would in turn try to parse the rest of the “Client Hello”, if it were successful it would proceed with TLS 1.3 connection, and upon failure it would give the file descriptor back to OpenSSL, to handle regularly.

這樣本來的 stack 就只要修改一小段程式碼,將當時還很頻繁修改的 TLS 1.3 draft 丟到另外一個 process 跑,就比較不用擔心本來的 stack 會有狀況了。

兩個 process 互相 kill 的可能性

Twitter 上看到的奇怪實驗 XD

Mac OS XUbuntu 上都有機會發生:

又一篇戰文:討論 TDD 的過程

最近在 39th International Conference on Software Engineering 上受邀參加說明的論文,在 The Morning Paper 上看到的:「A dissection of the test-driven development process: does it really matter to test-first or test-last?」。

論文本身在「A Dissection of the Test-Driven Development Process: Does It Really Matter to Test-First or to Test-Last?」這邊可以下載,是去年 2016 就發出來的論文。

論文拆解 TDD 的行為,分析到底是哪些階段才是有正面幫助的:

Background: Test-driven development (TDD) is a technique that repeats short coding cycles interleaved with testing. The developer first writes a unit test for the desired functionality, followed by the necessary production code, and refactors the code. Many empirical studies neglect unique process characteristics related to TDD iterative nature. Aim: We formulate four process characteristic: sequencing, granularity, uniformity, and refactoring effort. We investigate how these characteristics impact quality and productivity in TDD and related variations. Method: We analyzed 82 data points collected from 39 professionals, each capturing the process used while performing a specific development task. We built regression models to assess the impact of process characteristics on quality and productivity. Quality was measured by functional correctness.

比較特別的是作者指出 refactoring 的負面效果:

Result: Quality and productivity improvements were primarily positively associated with the granularity and uniformity. Sequencing, the order in which test and production code are written, had no important influence. Refactoring effort was negatively associated with both outcomes. We explain the unexpected negative correlation with quality by possible prevalence of mixed refactoring. Conclusion: The claimed benefits of TDD may not be due to its distinctive test-first dynamic, but rather due to the fact that TDD-like processes encourage fine-grained, steady steps that improve focus and flow.

在 The Morning Paper 上的註解也可以看看 (反駁論文的意見),像是對論文使用自評系統的批評。

Bash 裡處理 PID file 的方式...

看到「Age comparison in Bash for files and processes」後查了一些資料,如果在不使用外部程式處理的話,的確是多做了不少事情。

這是 Bash-ot 的說明:

file1 -ot file2
       True if file1 is older than file2, or if file2 exists and file1 does not.

而這是 test (也就是 [ 這隻程式) 對 -ot 的說明:

FILE1 -ot FILE2
       FILE1 is older than FILE2

多了檔案是否存在的檢查...

另外可以參考「What's the difference between [ and [[ in Bash?」這邊的說明,Bash 的 [[ 是 Bash 特有的加強版,而 [ 則是用系統的 test 運算。

加快 Ubuntu 上 Firefox 的速度...

新版的 Firefox 已經支援 Multi-processes 架構 (Electrolysis),但 Ubuntu 上會因為預設值的關係而被關閉,這篇文章就是講原因以及怎麼打開:「Enabling This Makes Firefox More Responsive On Ubuntu」。

由於大家都會裝一堆套件,看起來得用 Force Enable 這邊提到的方法打開,也就是手動在 about:config 內加入 browser.tabs.remote.force-enable,並設為 true

Firefox 的轉換還有很長一段時間要走...

下個版本 Firefox 的 Multi-Process 將預設全面開啟

Mozilla 說明了 Multi-Process 最近的一些進度:「Update on Multi-Process Firefox」。

就目前測試的資料看起來,速度大幅提昇:

Those users have been enjoying the 400% increase in responsiveness and a 700% improvement when web pages are loading.

現在的 Firefox 是 50 版,目前的情況是當 extension 標成 multi-process compatible 就會啟用:

With Firefox 49 we deployed multi-process Firefox to users with a select set of well tested extensions. Our measurements and user feedback were all positive and so with Firefox 50 we deployed multi-process Firefox to users with a broader set of extensions, those whose authors have marked them as multi-process compatible.

下一個版本 (51) 則是會全面開啟,除非有 extension 標成 multi-process incompatible:

Beyond Firefox 50, we have more work to do to enable multi-process Firefox for users with as yet unsupported extensions. In Firefox 51, if all testing goes according to plan, we’ll be enabling multi-process Firefox for users with extensions that are not explicitly marked as incompatible with multi-process Firefox.

總算要把整個架構換過去了...

Firefox 48 的 Electrolysis (e10s),多程序視窗

Firefox 的 Multi-process,也就是 Electrolysis,正式進入 Release 版了:「What’s Next for Multi-process Firefox」。

依照說明,會分成這幾個不同種類的 process:

Electrolysis child processes are currently in use for the following tasks within Firefox:

  • Legacy NPAPI plugin hosting
  • Media playback
  • Web content ('content processes')

Mozilla 給的這張圖則是 UI 與 Web Content 分離:

目前是設計成只有 1% 使用者會啟動,而如果你想要測試的話可以用 about:config 修改 browser.tabs.remote.autostart。不過只要有 extension 掛著,就不會啟動,像是這樣:

當我把 extension 全部關閉後:

所以短時間內應該也沒辦法測...

Firefox 48 將引入 Multi-process 模式

現在是 Firefox 47,也就是下個版本就會引入 Multi-process 模式了:「Firefox 48 finally enables Electrolysis for multi-process goodness」。

Multi-process 模式也就是 Electrolysis,這對權限的分離 (sandbox) 以及效能的提昇有著顯著的幫助:

Electrolysis functionality hosts, renders, or executes web related content in background child processes which communicate with the "parent" Firefox browser via various ipdl protocols. The two major advantages of this model are security and performance. Security improvements are accomplished through security sandboxing, performance improvements are born out of the fact that multiple processes better leverage available client computing power.

這算是個開頭,地基打出來以後,之後就可以拆更多東西到其他的 child process。