找數列的平均值

2016 年的文章,不過算是經典的題目,所以最近又冒出來了。要怎麼找數列的平均值:「Calculating the mean of a list of numbers」。

You have a list of floating point numbers. No nasty tricks - these aren’t NaN or Infinity, just normal “simple” floating point numbers.

Now: Calculate the mean (average). Can you do it?

你有一串浮點數 (沒有 NaN 與 Infinity),要怎麼找出平均值。要考慮的包括:

  • 第一個要處理的就是設計演算法時各種會 overflow 的情況。
  • 降低誤差。
  • 合理的計算量。

好像很適合拿來 data team 面試時互相討論的題目?因為「平均值」是個商業上本來就有意義的指標,而且從 time-series events 灌進來的資料量有機會產生各種 overflow 情境,或是精確度問題,所以這個問題其實是個在真實世界上會遇到的情境。

想了一下,如果是 integer 的確是簡單很多 (可以算出正確的值),但如果是 float 類型真的難很多:

It also demonstrates a problem: Floating point mathematics is very hard, and this makes it somewhat unsuitable for testing with Hypothesis.

馬上想到的地雷是在 IEEE 754 的 float 世界裡,2^24 + 1 還是 2^24

#include <math.h>
#include <stdio.h>

int main(void)
{
    int i;
    float a;

    for (i = 0; i < 32; i++) {
        a = pow(2, i);
        printf("2^%d     = %f\n", i, a);

        a += 1;
        printf("2^%d + 1 = %f\n", i, a);
    }
}

然後在這邊可以看出差異:

2^23     = 8388608.000000
2^23 + 1 = 8388609.000000
2^24     = 16777216.000000
2^24 + 1 = 16777216.000000

0.1 + 0.2 = 0.30000000000000004

看到「http://0.30000000000000004.com/」這個網站對經典的 0.1 + 0.2 問題整理了各語言的結果。這個網址名稱也很機車啊 XD

開頭的說明講述 IEEE 754 二進制表示法的問題:

Your language isn't broken, it's doing floating point math. Computers can only natively store integers, so they need some way of representing decimal numbers. This representation comes with some degree of inaccuracy. That's why, more often than not, .1 + .2 != .3.

It's actually pretty simple. When you have a base 10 system (like ours), it can only express fractions that use a prime factor of the base. The prime factors of 10 are 2 and 5. So 1/2, 1/4, 1/5, 1/8, and 1/10 can all be expressed cleanly because the denominators all use prime factors of 10. In contrast, 1/3, 1/6, and 1/7 are all repeating decimals because their denominators use a prime factor of 3 or 7. In binary (or base 2), the only prime factor is 2. So you can only express fractions cleanly which only contain 2 as a prime factor. In binary, 1/2, 1/4, 1/8 would all be expressed cleanly as decimals. While, 1/5 or 1/10 would be repeating decimals. So 0.1 and 0.2 (1/10 and 1/5) while clean decimals in a base 10 system, are repeating decimals in the base 2 system the computer is operating in. When you do math on these repeating decimals, you end up with leftovers which carry over when you convert the computer's base 2 (binary) number into a more human readable base 10 number.

這邊主要是討論 IEEE 754-1985 這個標準,後來在 IEEE 754-2008 提出了新的表示方法,支援十進位的表示法來解這個問題 (雖然還沒普及)。