用程式解數學邏輯問題...

Hacker News Daily 上看到的數學邏輯問題:「“Which answer in this list is the correct answer to this question?”」。

問題是這樣:

Which answer in this list is the correct answer to this question?

  • All of the below.
  • None of the below.
  • All of the above.
  • One of the above.
  • None of the above.
  • None of the above.

accepted 的那個是推演的答案,但最高分的那個是寫程式窮舉 XDDD (不得不說大家都很愛這味...)

phk 的 ministat

Facebook 上看到朋友貼的統計分析小工具:「A small tool to do the statistics legwork on benchmarks etc.」,看了一下原來是 phk 寫的,後來被拉出來獨立跑...

從兩個檔案讀取兩組數列,然後用 Student's t-test 分析的小工具,在 manpage 裡面可以看到說明:

Specify desired confidence level for Student's T analysis. Possible values are 80, 90, 95, 98, 99 and 99.5 %

雖然說有些人不喜歡 Student's t-test 被濫用,不過畢竟還是一套合理的數學方法,在分析的時候可以快速的判斷...

翻了一下發現 Ubuntu 也有得裝:「Ubuntu – Package Search Results -- ministat」。

引用自己論文的問題...

Nature 上點出來期刊論文裡自我引用的問題 (這邊的自我引用包括了合作過的人):「Hundreds of extreme self-citing scientists revealed in new database」。

開頭舉了一個極端的例子,Vaidyanathan 的自我引用比率高達 94%,而學界的中位數是 12.7%,感覺是有某種制度造成的行為?

Vaidyanathan, a computer scientist at the Vel Tech R&D Institute of Technology, a privately run institute, is an extreme example: he has received 94% of his citations from himself or his co-authors up to 2017, according to a study in PLoS Biology this month. He is not alone. The data set, which lists around 100,000 researchers, shows that at least 250 scientists have amassed more than 50% of their citations from themselves or their co-authors, while the median self-citation rate is 12.7%.

會想要提是因為想到當年 Google 的經典演算法 PageRank,就是在處理這個問題... 把 paper 換成 webpage 而已。

Miles 換算 KM 的方式

Twitter 上看到很有趣的方式:

這邊可以這樣算是因為 1.609 跟黃金比率很接近,而 Fibonacci number 的也有黃金比率的特性,所以可以直接拿來用...

英國新的紙鈔將會使用 Alan Turing 頭像

新版 50 英鎊的紙鈔將使用 Alan Turing 的頭像設計:「New face of the Bank of England's £50 note is revealed as Alan Turing」。

不知道要怎麼介紹 Alan Turing... 對於現代計算理論的貢獻、對於二戰盟軍的貢獻,以及對於人工智慧的貢獻都無與倫比,另外一方面,在 1952 年時因同性戀身份被定罪,1954 年時食用氰化物自殺過世,然後到了 2013 年議會爭論赦免的過程中,英國女皇決定直接行使赦免權。現在則是決定以他的頭像作為五十英鎊的人物。

既然靠這個圈子吃飯的,應該會蒐藏一張起來吧,紀念這位英雄...

Elsevier 限制加州大學的存取權限

三月的時候加州大學系統 (UC) 因為 Elsevier 不接受 open access 的條件而公開宣佈不續約 (參考「加州大學宣佈不與 Elsevier 續約」),後來 Elsevier 應該是試著看看有沒有機會繼續合作,所以在這段期間還是一直提供服務給加州大學系統。

前幾天在 Hacker News 上看到「Elsevier cuts off UC’s access to its academic journals (latimes.com)」,總算是確定要動手了:「In act of brinkmanship, a big publisher cuts off UC’s access to its academic journals」。

不過也不是直接拔掉,而是限制存取權,看不到新東西 (以 2019/01/01 為界):

As of Wednesday, Elsevier cut off access by UC faculty, staff and students to articles published since Jan. 1 in 2,500 Elsevier journals, including respected medical publications such as Cell and the Lancet and a host of engineering and scientific journals. Access to most material published in 2018 and earlier remains in force.

UC 提出的商業模式是讓投稿者負擔費用,而存取者不需要負擔,與現有的商業模式剛好相反。UC 提出的模式鼓勵「知識的散佈」,而現有的商業模式則是反過來,希望透過知識的散佈而賺~大~錢~發~大~財~:

UC demanded that the new contract reflect the principle of open access — that work produced on its campuses be available to all outside readers, for free.

That was a direct challenge to the business model of Elsevier and other big academic publishers. Traditionally, the publishers accept papers for publication for free but charge steep subscription fees. UC is determined to operate under an alternative model, in which researchers pay to have their papers published but not for subscriptions.

另外在 Hacker News 上的 comment 裡看到一些專案也正在進行,像是歐洲的「Plan S」也是在推動 open access:

The plan requires scientists and researchers who benefit from state-funded research organisations and institutions to publish their work in open repositories or in journals that are available to all by 2021.

另外「PubPub · Community Publishing」也是 open source 領域裡蠻有趣的計畫,後面看起來也有不少學術單位在支持。

二戰時德國坦克製造速度的估算問題

看到「The German Tank Problem」這篇在講二戰很有名的統計應用。這個主題在中文的維基百科寫得還蠻完整的,讀起來應該會更快一些:「德國坦克問題」:

在統計學理論的估計中,用不放回抽樣來估計離散型均勻分布最大值問題中著名的德國坦克問題(英語:German tank problem),它因在第二次世界大戰中用於估計德國坦克數量而得名。

如同上面所說的,這個方法是因為估算的準確度極高而知名:

對坦克車輪的分析產生了對使用中的車輪模具數量的估計。在與英國車輪製造商討論過後,他們估計了這麼多的模具可以生產多少車輪,進而是每個月可生產的坦克數量。對兩輛坦克(每輛32個車輪,總計64個車輪)車輪的分析的結果是1944年2月的生產數量估計在270左右,大大超出此前預期。

德國戰後公布的記錄顯示,1944年2月一個月的生產量是276輛。統計方法結果的精確度是常規情報收集方法所遠遠不能達到的,而「德國坦克問題」這個詞也成為了這種統計分析問題的標誌。

而且之後被拿來推敲經典的 Commodore 64 的數量也還蠻準的:

該公式在非軍事中也有使用,如估計Commodore 64計算機的總數,其結果(1.25億)與官方數字相當匹配。

GrabFood 用定位資料修正餐廳的資訊

Grab 的「How we harnessed the wisdom of crowds to improve restaurant location accuracy」這篇是他們的 data team 整理出來,如何使用既有的資料快速的修正餐廳資訊。裡面提到的方法不需要用到 machine learning,光是一些簡單的統計算法就可以快速修正現有的架構。

這些資訊其實是透過司機用的 driver app 蒐集來的,在 driver app 上有大量的資訊傳回伺服器 (像是定時回報的 GPS 位置,以及取餐狀態),而這些司機因為地緣關係,腦袋裡的資訊比地圖會準不少:

One of the biggest advantages we have is the huge driver-partner fleet we have on the ground in cities across Southeast Asia. They know the roads and cities like the back of their hand, and they are resourceful. As a result, they are often able to find the restaurants and complete orders even if the location was registered incorrectly.

所以透過這些資訊他們就可以反過來改善地圖資料,像是透過司機按下「取餐」的按鈕的地點與待的時間,就可以估算餐聽可能的位置,然後拿這個資訊比對地圖上的資料,就很容易發現搬家但是地圖上沒更新的情況:

Fraction of the orders where the pick-up location was not “at” the restaurant: This fraction indicates the number of orders with a pick-up location not near the registered restaurant location (with near being defined both spatially and temporally as above). A higher value indicates a higher likelihood of the restaurant not being in the registered location subject to order volume

Median distance between registered and estimated locations: This factor is used to rank restaurants by a notion of “importance”. A restaurant which is just outside the fixed radius from above can be addressed after another restaurant which is a kilometer away.

另外也有不少其他的改善 (像是必須在離餐聽某個距離內才能點「取餐」,這個「距離」會因為餐聽可能在室內商場而需要的調整),整個成果就會反應在訂單的取消率大幅下降:

整體看起來是系統產生清單後讓人工後續處理 (像是打電話去店家問?),但這個方式所提供的清單準確度應該很高 (因為司機不會沒事跟自己時間過不去,跑到奇怪地方按下取餐),用這些資料跑簡單的演算法就能夠快速修正不少問題...

加州大學宣佈不與 Elsevier 續約

加州大學 (這是一個大學系統,包括了十個校區,超過 25 萬的學生與 14 萬的教職員) 認為 Elsevier 沒有達到 open access 應有的標準,決定將不再跟 Elsevier 續約,並且發出新聞稿抨擊 Elsevier:「UC terminates subscriptions with world’s largest scientific publisher in push for open access to publicly funded research」。

As a leader in the global movement toward open access to publicly funded research, the University of California is taking a firm stand by deciding not to renew its subscriptions with Elsevier. Despite months of contract negotiations, Elsevier was unwilling to meet UC’s key goal: securing universal open access to UC research while containing the rapidly escalating costs associated with for-profit journals.

這應該是美國頂尖學院裡面的第一槍?後續會帶動多少單位不續訂...

歐洲研究機構的資助者推動研究論文的開放存取

在「Radical open-access plan could spell end to journal subscriptions」這邊看到歐洲 11 個研究機構資助者成立了「cOAlition S」,推動研究論文的開放存取。

目標是在 2020 年開始,由這些機構所資助的研究都必須投在符合完全開放條件的平台上:

cOAlition S signals the commitment to implement, by 1 January 2020, the necessary measures to fulfil its main principle: “By 2020 scientific publications that result from research funded by public grants provided by participating national and European research councils and funding bodies, must be published in compliant Open Access Journals or on compliant Open Access Platforms.

而現在大約只有 15%:

According to a December 2017 analysis, only around 15% of journals publish work immediately as open access (see ‘Publishing models’) — financed by charging per-article fees to authors or their funders, negotiating general open-publishing contracts with funders, or through other means.

用這種方式降低那些收錢才能下載的平台的影響力...