4

phi相关系数的算法

 2 years ago
source link: http://fantaghiro.github.io/study/2016/04/30/phi-coefficient.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
phi相关系数的算法

2016-04-30

在阅读Eloquent Javascript一书时,在第四章中碰到了phi coefficient的计算问题,决定仔细记录一下。

内容与该页相同。

维基百科中,phi相关系数是这样定义的:

在统计学里,phi相关系数是测量两个二元变数(binary variables or dichotomous variables)之间相关性的工具,由卡尔·皮尔森发明。

算法如下:

将两个变数排成2×2列联表,注意1和0的位置必须如下表。若只变动x或只变动y的0/1位置,计算出来的phi相关系数会正负号相反。若同时变动x和y的0/1位置,则计算出来的phi相关系数相同。phi相关系数的基本概念是:两个二元变数的观察值若大多落在2×2列联表的“主对角线”(英语:diagonal:左上-右下线)栏位,亦即若观察值大多为(x,y)=(1,1),(0,0)这两种组合,那么这两个变数呈正相关。反之,若两个二元变数的观察值大多落在“非对角线”(英语:off-diagonal:主对角线以外的位置)栏位,对应于2×2列联表,亦即若观察值大多为(x,y)=(0,1),(1,0)这两种组合,则这两个变数呈负相关。例如,我们从两个随机二元变数(x,y)抽样得出这样的2×2列联表:

y=1 y=0 总计
x=1 n11 n10 n1·
x=0 n01 n00 n0·
总计 n·1 n·0 n

其中n11, n10, n01, n00都是非负数的栏位记次值,它们加总为n,亦即观察值的个数。由上面的表格可以得出x和y的phi相关系数如下:

\phi = (n11n00 - n10n01) / Math.sqrt(n1·n0·n·0n·1)

假如说,我们有下面100条数据,是关于性别与使用左手还是右手的记录,我们应该怎样得到上面这样一个表格呢?

var log = {
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "left"},    
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "left"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "female", "hand": "right"},
    {"sex": "male", "hand": "right"},
}

通过手工计数,我们得到下表:

男=0 女=1 总计
右=0 43 44 87
左=1 9 4 13
总计 52 48 100

根据上表,00位置代表“男性、右手”,01代表“男性、左手”,10代表“女性、右手”、11代表“女性、左手”。

将位置00、01、10、11的数值与一维数组table中的值对应,可以有两种看待的方式:

第一种:如果将00、01、10、11视为二进制数的话,那么他们代表的就分别是0、1、2、3(即右侧第一位的1计数为1、右侧第二位上的1计数为2,两者相加得到对应于一维数组中的index),那么分别于一维数组对应,就对应着table[0]、table[1]、table[2]、table[3]。

那么phi的计算方式应当如下:

function phi(table){
    return (table[3]*table[0] - table[2]*table[1])/Math.sqrt((table[2] + table[3]) *
                (table[0] + table[1]) *
                (table[1] + table[3]) *
                (table[0] + table[2]));
}

另外,要用程序的方式得到原本我们手工计数的那张表格,我们要对数据进行遍历:

var table = [0, 0, 0, 0];

log.forEach(function(item){
    var index = 0;
    if(item["sex"] == "female"){
        index += 1; //如果为女,相当于列联表的右侧第一位为1
    }
    if(item["hand"] == "left"){
        index += 2 //如果为左手,相当于列联表的左侧第一位为1
    }
    table[index] += 1; //每次循环,table相应的位置的值计数加1
});

console.log(table); //[43, 44, 9, 4]
//然后将得到的table代入上面的phi函数中即可得到结果
console.log(phi(table)); //-0.133319729973268

第二种:如果依然将00、01、10、11视为网格的坐标的话,那么也很简单。将其转化为一维数组,只需要通过 x + y * width 这个公式就可以得到相应位置的值(表格width为2):00位置对应 table[0 * 2 + 0]、01位置对应 table[0*2 + 1]、10位置对应[1 * 2 + 0]、11位置对应的是table[1*2 + 1],那么这时候00、01、10、11分别对应的也是table[0]、table[1]、table[2]、table[3],与上面的一种方式得到的是同一个数组。注意,此时的00、01、10、11一定要搞清楚哪个是x、哪个是y。这时候其实前面一个数值对应的是y的值;而后面一个对应的是x的值。后续算法与上一中算法也是相同的。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK