Distance Measures Previous Top Next

For the calculation of similarities or dissimilarities (distances) the number of total matches (a), single matches (b, c) and no matches (d) are calculated out of the number of total positions (n = a+b+c+d).

> Tab. 1 shows how these variables are defined.
> Examples for non-normalized or normalized distance measures are given in Tab. 2.
> Formulas for several distance or similarity measures are given in Tab. 3.

Tab. 1. Variables used to calculate binary similarities/dissimilarities.
 Sample Character 1 0 Sample #1 1 a = 1/1 (in both samples) b = 1/0 (only in sample #1) Sample #2 0 c = 0/1 (only in sample #2) d = 0/0 (in non of the samples)

Tab. 2. Examples for distances.
 Name Formula Comment Hamming distance (Manhattan, city-block, taxi-cab) b+c non-normalized distance, increases with the number of characteristics Euclidian distance sqrt(b+c) non-normalized distance, increases with the number of characteristics Soergel distance (b+c)/(b+c+d) normalized distance, complementary to Tanimoto: 1-a/(b+c+d) Mean Hamming distance (b+c)/n = (b+c)/(a+b+c+d) normalized distance Mean Euclidian distance sqrt((b+c)/n) = sqrt(b+c)/(a+b+c+d)**2 normalized distance

Tab. 3. Normalized similarity and correlation measures.
 Name Similarity Formula Jaccard a/(n-d) [= a / (a+b+c)] Russel & Rao a/n Rogers & Tanimoto (a+d)/(a+2*(b+c)+d) Kulczynski #1 a/(b+c) Kulczynski #2 0.5*(a/(a+b)+a/(a+c)) Dice 2*a/(2*a+b+c) Pearson's Phi coefficient ((a*d)-(c*b))/sqrt((a+c)*(c+d)*(a+b)*(b+d)) - bc - 1 - D Baroni-Urbani/Buser (a+sqrt(a*d))/(a+b+c+sqrt(a*d)) Braun-Blanquet if (a+b)>(a+c) then S:=a/(a+b) else S:=a/(a+c) Simpson similarity coefficient if (a+b)<(a+c) then S := a/(a+b) else S := a/(a+c) Michael 4*(a*d-b*c)/((a+d)*(a+d)+(b+c)*(b+c)) Sokal and Sneath #1 a/(a+2*(b+c)) SokalSneath #2 0.25 * ( a/(a+b) + a/(a+c) + d/(b+d) + d/(c+d) ) SokalSneath #3 a*d/sqrt((a+b)*(a+c)*(d+b)*(d+c)) Sokal and Sneath #4 (a+d)/(b+c) Sokal and Sneath #5 2*(a+d)/(2*(a+d)+b+c) Simple Matching (a+d)/(a+b+c+d) =(a+d)/n Mean Hamming 1 - D Sneath & Sokal (a+d)/(a +0.5*(b+c)+d) Kocher & Wong a*n/((a+b)*(c+d)) Faith (a+d/2)/n Ochiaï #1 a/sqrt((a+b)*(a+c)) Ochiaï #2 a*d/sqrt((a+b)*(a+c)*(d+b)*(d+c)) Q0 (b*c)/(a*d) Yule's Sigma (sqrt(a*d)-sqrt(b*c))/(sqrt(a*d)+sqrt(b*c)) Yule's Q (a*d-b*c)/(a*d+b*c) Upholt F = 2*a/(2*a+b+c) S = Power(0.5 * (-F+sqrt(F*F+8*F)) , (1/n)); Excoffier n*(1-(a/n)) Hamann (a-(b+c)+d)/n Roux #1 (a+d) / (min(b,c)+min(n-b,n-c)) Roux #2 (n-a*d) / sqrt ((a+b)*(c+d)*(a+c)*(b+d)) Michelet a*a/(b*c) Fager & McGowan a/sqrt((a+b)*(a+c)) + 1/sqrt(a+b) Fager a/sqrt((a+b)*(a+c)) - max(b,c) Unigram subtuples Log(a*d/b/c)-3.29*sqrt(1/a+1/b+1/c+1/d) U cost Log(1+(min(b,c)+a)/(max(b,c)+a)) S cost Log(1+min(b,c)/(a+1))**-.5 R cost Log(1+a/(a+b))*log(1+a/(a+c)) T combined cost Sqrt(U * S * R) U = Log(1+(min(b,c)+a)/(max(b,c)+a)) S = Log(1+min(b,c)/(a+1))**-.5 R = Log(1+a/(a+b))*log(1+a/(a+c)) McConnoughy (a*a - b*c) / sqrt((a+b)*(a+c)) Phi Square power((a*d + b*c), 2) / ((a+b)*(a+c)*(b+c)*(b+d)) Forbes n*a/((a+b)*(a+c)) Fossum n*(a-0.5)*(a-0.5)/((a+b)*(a+c)) Stiles log10(n*power((abs(a*d-b*c)-n/2) , 2) / ( (a+b)*(a+c)*(b+d)*(c+d) )) Dispersion (a*d-b*c)/power(a+b+c+d, 2) Dennis (a*d-b*c)/sqrt(n*(a+b)*(a+c)) Pearson Chi Square n * power (abs(a*d-b*c)-n/2 , 2)/((a+b)*(c+d)*(a+c)*(b+d)) Mountford 2*a/(2*b*c+a*b+a*c)) Mutual Information ln(a*n/((a+b)*(a+c))) Weighted Mutual Information  #3 ln(power(a, 3)*n/((a+b)*(a+c))) Chi Square with correction of Yates n/2 (n * (abs(a*d-b*c)-n/2)**2)/((a+b)*(c+d)*(a+c)*(b+d)) Normalized Collacation a/(b+c-a) Dunning 2*(a*log(a) + b*log(b) + c*log(c) + d*log(d)) - (a+b)*log(a+b) - (a+c)*log(a+c) - (b+d)*log(b+d) - (c+d)*log(c+d) + (a+b+c+d)*log(a+b+c+d)