search – Page 11 – Gea-Suan Lin's BLOG

KKBOX 徵人：平台營運處 (API Team)

索引：

KKBOX 徵人 (Overview)
KKBOX 徵人：平台營運處 (API Team)
KKBOX 徵人：軟體開發中心 (i.e. Client Team)

續上篇的「KKBOX 徵人」，順便跟 Client Team 的同事徵文，等他寫完後也會貼出來讓大家知道 Client Team 目前找什麼人。

Server Team 這邊徵人的部份順著每個部門說明，這次先講平台營運處 (API Team)。

曲庫開發部

曲庫開發部，負責接唱片公司所提供的 API 以及 DDEX 資料，將這些資料半自動或是自動化整合到 KKBOX 的系統內。

另外這個部門在某些情況下，會需要寫程式特殊處理曲庫資料。舉例來說，前陣子金牌大風被華納音樂集團併購，這時候就有授權單位轉移的工作要做。

人工上架的系統也是這個部門開發，由公司另外的部門作業。

API 開發部

API 開發部，負責開發與維護 KKBOX 應用程式的 API。

平台開發部

平台開發部，負責系統建制。我用條列的方式試著列一些出來：(應該是列不完)

搜尋引擎的設計與維運，目前現在是使用 Solr，正在研究翻新成 Elasticsearch。
與曲庫開發部合作，像是音檔轉檔與 DRM 機制。
與 API 開發部合作，像是依照商業邏輯選擇使用我們自己租用的國際頻寬，或是使用 Akamai 供應音檔。
各種通靈業務。

影音服務開發部

影音相關的研發，也是偏 Server Side 的部份。

找什麼樣的人？

不限於這些，可以是聯集也可以有其他技能：

系統分析、系統設計 (SA & SD)，包括了以上業務的分析與設計，主管會調度分配對應的項目。
Java 工程師 (以及資深工程師)，目前主要是針對平台開發部的搜尋引擎。
PHP 工程師 (以及資深工程師)，這邊提到的四個部門都有在找。
Full Stack Engineer，平台開發部與影音服務開發部都有找。

待續...

冨樫中...

.SUCKS 的 Domain...

easyDNS 這篇「Why We Will Not Be Registering easyDNS.SUCKS」把議題拋出來了。

.SUCKS 的 domain 申請的目的就是要發橫財，而這也從 Sunrise Claim 的價錢看出來：USD$2499。而當初抗議成立時也沒被接受，現在看起來不像會有進一步的進展。

除非如同 comment 所提到的被 Google 抵制，不過實在不像...

歐盟開始調查 Google 的壟斷

歐盟開始調查 Google 的壟斷：「Europe to accuse Google of illegally abusing its dominance」，原始報導出自「Europe to accuse Google of illegally abusing its dominance」。

不知道是不是跟先前在「Google 利用搜索的壟斷優勢打擊競爭對手」提到的文件有關：

FBI 的搜索「創意」被美國法院否決

在 Zite 上看到「FBI can’t cut Internet and pose as cable guy to search property, judge says」這篇文章，講 FBI 的「創意」被法院給否決。

搜索的手段是這樣發生的。FBI 的人先把網路給剪斷，然後偽裝成修復工人進去搜索：

The Las Vegas court frowned on the FBI's ruse of disconnecting Internet access to $25,000-per-night villas at Caesar's Palace Hotel and Casino. FBI agents posed as the cable guy and secretly searched the premises.

然後就宣稱因為這是被邀請入內，所以搜索是合法的：

The government claimed the search was legal because the suspects invited the agents into the room to fix the Internet.

不過法官顯然不買帳，引用法官的話：

Permitting the government to create the need for the occupant to invite a third party into his or her home would effectively allow the government to conduct warrantless searches of the vast majority of residents and hotel rooms in America,

也就是說，除非當事人明確知道搜索並且同意，不然這種惡搞「同意」的行為並不合法。

用 DuckDuckGo 一個月後的感想...

把搜尋引擎換成 DuckDuckGo 應該差不多一個月了吧？算是脫離 Google 的第一步？(連 iPhone 上的都換掉)

剛開始切換過去的時候會感覺「好像搜不太到東西」，試著把關鍵字丟進 Google 找，就發現也搜不太到東西... 多幾次以後反而發現 DuckDuckGo 上面的 spam 比較少？反而找起來比較順。

比較不方便的是匯率換算，以前用 Google 的時候常常打 1jpy、1usd、1eur 去看匯率啊，現在就沒辦法了。再來就是速度稍微慢一點，不過還可以接受。

來看看有沒有機會換掉 Gmail，雖然我覺得難度有點高... 就邊找邊繼續用吧 :o

Google 利用搜索的壟斷優勢打擊競爭對手

最近很熱門的新聞：「FTC: Google Altered Search Results For Profit」，國內也有媒體已經報導 (報導：FTC機密文件指控Google不當商業行為壟斷搜尋市場)，不過沒受到太多注意？

起因自 The Wall Street Journal (WSJ) 透過 FOIA (資訊自由法) 要資料的時候，意外拿到 2012 年 FTC 對 Google 壟斷而做出評估的文件，整個案件於 2013 年年初達成和解。

WSJ 的報導可以參考「Inside the U.S. Antitrust Probe of Google」這篇。(有 Paywall，可以透過 Google 搜尋這個標題後再點進去 XDDD)

另外 Google 內部知道他們的市占率比外部估出來的高出不少 (外部估算 65%，但內部自己評估 69% 到 84%)，但也因此感到欣慰 (避免了反壟斷的壓力)：

Data included in the report suggest Google was more dominant in the U.S. Internet search market than was widely believed. The company estimated its market share at between 69% and 84% during a period when research firm comScore put it at 65%. “From an antitrust perspective, I’m happy to see [comScore] underestimate our share,” the report quoted Google Chief Economist Hal Varian as saying, without specifying the context.

接下來看看美國政府會怎麼出招，另外歐盟也應該會交叉參考？

MySQL 5.7 的 InnoDB 的全文搜尋

在「InnoDB Full-Text : N-gram Parser」這邊看到對 MySQL 5.7 InnoDB 的全文搜尋功能介紹。開頭就有很重要的說明：

I’m now very happy to say that in MySQL 5.7.6 we’ve made use of the new pluggable full-text parser support in order to provide you with an n-gram parser that can be used with CJK!

這對資料量在中等或是更少的公司相當方便，你可以架 replication server 專門跑 search，而不需要利用 reliable queue 確保更新後推進 Solr 或 Elastic (改名了，之前叫 ElasticSearch)。

不過，如果資料量很大的話應該還是得用 Solr 或 Elastic 的方案...

Google 發表計算網頁真實性的演算法 (Knowledge-Based Trust)

Slashdot 上看到 Google 發表了計算網頁真實性的演算法，Knowledge-Based Trust (KBT)：「Google Wants To Rank Websites Based On Facts Not Links」，原始的論文 PDF 檔案可以在「Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources」這邊取得。

論文本身的原理不難懂 (其實方法相當有趣)，主要是給出了三個貢獻。

首先是能夠區分是取出資訊的方法有問題 (extract 的演算法不夠好)，或是網站本身就給出錯誤的資訊：

Our main contribution is a more sophisticated probabilistic model, which can distinguish between two main sources of error: incorrect facts on a page, and incorrect extractions made by an extraction system.

第二個則是在效能上的改善：

Our second contribution is a new method to adaptively decide the granularity of sources to work with: if a specific webpage yields too few triples, we may aggregate it with other webpages from the same website. Conversely, if a website has too many triples, we may split it into smaller ones, to avoid computational bottlenecks (Section 4).

第三個則是提出好的分散式演算法，可以螞蟻雄兵計算出來:

The third contribution of this paper is a detailed, large-scale evaluation of the performance of our model.

KBT 並不是要取代 PageRank，而是跟 PageRank 互相配合，可以有效打擊內容農場 (Content farm) 這類網站，畢竟 PageRank 的假設在一般的狀況下是有邏輯的。

在「High PageRank but low KBT (top-left corner)」這段講到了這件事情：

We consider the 15 gossip websites listed in [16]. Among them, 14 have a PageRank among top 15% of the websites, since such websites are often popular. However, for all of them the KBT are in the bottom 50%; in other words, they are considered less trustworthy than half of the websites. Another kind of websites that often get low KBT are forum websites.

再找時間細讀其他類似的演算法...

中國的關鍵字審查

Slashdot 的「New Compilation of Banned Chinese Search-Terms Reveals Curiosities」這篇引用了「Some curious search terms denied to the Chinese」這篇文章，在 GitHub 上面有個 repository 試著蒐集這些關鍵字：「jasonqng/chinese-keywords。

不過看到報導第一件事情注意到的事情是他用的圖片：

還是說其實台灣已經高度審查了？hmmm...

維基基金會的 2014 年八月月報

維基基金會釋出八月月報 (好像晚了三個月？)：「Wikimedia Foundation Report, August 2014」，在「Wikimedia Highlights, August 2014」有比較精簡的版本。

維基基金會在報告裡有提供一些 PV 相關的數據，包括 comScore 的數字與自己 server log 所統計出來的數據。另外也包含了財務狀況。

其中技術相關的是取自「Wikimedia Engineering/Report/2014/August」這頁。另外因為這是八月的資料，我順便偷看了九月與十月的「Wikimedia Engineering/Report/2014/September」與「Wikimedia Engineering/Report/2014/October」。

可以看到在測試 HHVM 的計畫，而且目前看起來還不錯：「[Wikitech-l] [Engineering] Migrating test.wikipedia.org to HHVM」，拿了 test.wikipedia.org 測試，其中 speed test 的部份有大幅改善：

1) Speed test: measure the time taken to request the page 1000 times over just 10 concurrent connections:

                        HHVM    Zend    diff
Mean time (ms):         233     441     -47%
99th percentile (ms):   370     869     -57%
Request/s:              43      22.6    +90%

而負載測試的成果更好：

2) Load test: measure how much thoughput we obtain when hogging the appserver with 50 concurrent requests for a grand total of 10000 requests. What I wanted to test in this case was the performance degradation and the systems resource consumption

                        HHVM    Zend    diff
Mean time (ms):         355     906     -61%
99th percentile (ms):   791     1453    -45%
Request/s:              141     55.1    +156%
Network (Mbytes/s)      17      7       +142%
RAM used (GBs):         5(1)    11(4)
CPU usage (%):          90(75)  100(90)

維基百科之所以沒有遇到太多問題，主要是因為所使用的軟體是 open source 而且夠大的關係，直接成為 HHVM 測試的一環：「Compatibility Update」。

不過目前看起來應該還是跑 PHP，沒有看到整個都轉換過去的計畫。

另外一方面，搜尋引擎的更換就沒有這麼順利，雖然換到 Elasticsearch 後改善不少，不過可以看到八月的報告這樣寫：

tarted deploying Cirrus as the primary search back-end to more of the remaining wikis and we found what looks like our biggest open performance bottleneck. Next month's goal is to fix it and deploy to more wikis (probably not all). We're also working on getting more hardware.

而九月時就講到沒有銀彈，要加硬體去拼：

In September we worked to mitigate the performance bottleneck that we found in August. We found there to be no silver bullet but used the information we learned to pick and order appropriate hardware to handle the remaining wikis. We also implemented out significantly improved wikitext Regular Expression search. In October we've begun rolling out the wikitext Regular Expression search and received some of the hardware we need to finish cutting over the remaining wikis. We believe we'll get it all installed in October and cut the remaining wikis over in November.

十月的時候弄到機器了：

In October we prepared for November in which we deployed Cirrus to all the remaining wikis by installing new servers installing new versions of Elasticsearch and our plugins. We also fixed up regex search which had caused a search outage.

這些報告的連結裡面其實有些不會在對外新聞稿上面的評語... XD