Distance Measures
Previous Top Next

For the calculation of similarities or dissimilarities (distances) the number of total matches (a), single matches (b, c) and no matches (d) are calculated out of the number of total positions (n = a+b+c+d).

> Tab. 1 shows how these variables are defined.
> Examples for non-normalized or normalized distance measures are given in Tab. 2.
> Formulas for several distance or similarity measures are given in Tab. 3.


Tab. 1. Variables used to calculate binary similarities/dissimilarities.
Sample
Character
1
0
Sample #1
1
a = 1/1 (in both samples)
b = 1/0 (only in sample #1)
Sample #2
0
c = 0/1 (only in sample #2)
d = 0/0 (in non of the samples)



Tab. 2. Examples for distances.
Name
Formula
Comment
Hamming distance (Manhattan, city-block, taxi-cab)
b+c
non-normalized distance,
increases with the number of characteristics
Euclidian distance
sqrt(b+c)
non-normalized distance,
increases with the number of characteristics
Soergel distance
(b+c)/(b+c+d)
normalized distance,
complementary to Tanimoto: 1-a/(b+c+d)
Mean Hamming distance
(b+c)/n
= (b+c)/(a+b+c+d)
normalized distance
Mean Euclidian distance
sqrt((b+c)/n)
= sqrt(b+c)/(a+b+c+d)**2
normalized distance


Tab. 3. Normalized similarity and correlation measures.
Name
Similarity Formula
Jaccard
a/(n-d)
[= a / (a+b+c)]
Russel & Rao
a/n
Rogers & Tanimoto
(a+d)/(a+2*(b+c)+d)
Kulczynski #1
a/(b+c)
Kulczynski #2
0.5*(a/(a+b)+a/(a+c))
Dice
2*a/(2*a+b+c)
Pearson's Phi coefficient
((a*d)-(c*b))/sqrt((a+c)*(c+d)*(a+b)*(b+d))
- bc -
1 - D
Baroni-Urbani/Buser
(a+sqrt(a*d))/(a+b+c+sqrt(a*d))
Braun-Blanquet
if (a+b)>(a+c) then S:=a/(a+b) else S:=a/(a+c)
Simpson similarity coefficient
if (a+b)<(a+c) then S := a/(a+b) else S := a/(a+c)
Michael
4*(a*d-b*c)/((a+d)*(a+d)+(b+c)*(b+c))
Sokal and Sneath #1
a/(a+2*(b+c))
SokalSneath #2
0.25 * ( a/(a+b)
+ a/(a+c)
+ d/(b+d)
+ d/(c+d) )
SokalSneath #3
a*d/sqrt((a+b)*(a+c)*(d+b)*(d+c))
Sokal and Sneath #4
(a+d)/(b+c)
Sokal and Sneath #5
2*(a+d)/(2*(a+d)+b+c)
Simple Matching
(a+d)/(a+b+c+d)
=(a+d)/n
Mean Hamming
1 - D
Sneath & Sokal
(a+d)/(a +0.5*(b+c)+d)
Kocher & Wong
a*n/((a+b)*(c+d))
Faith
(a+d/2)/n
Ochiaï #1
a/sqrt((a+b)*(a+c))
Ochiaï #2
a*d/sqrt((a+b)*(a+c)*(d+b)*(d+c))
Q0
(b*c)/(a*d)
Yule's Sigma
(sqrt(a*d)-sqrt(b*c))/(sqrt(a*d)+sqrt(b*c))
Yule's Q
(a*d-b*c)/(a*d+b*c)
Upholt
F = 2*a/(2*a+b+c)
S = Power(0.5 * (-F+sqrt(F*F+8*F)) , (1/n));
Excoffier
n*(1-(a/n))
Hamann
(a-(b+c)+d)/n
Roux #1
(a+d) / (min(b,c)+min(n-b,n-c))
Roux #2
(n-a*d) / sqrt ((a+b)*(c+d)*(a+c)*(b+d))
Michelet
a*a/(b*c)
Fager & McGowan
a/sqrt((a+b)*(a+c)) + 1/sqrt(a+b)
Fager
a/sqrt((a+b)*(a+c)) - max(b,c)
Unigram subtuples
Log(a*d/b/c)-3.29*sqrt(1/a+1/b+1/c+1/d)
U cost
Log(1+(min(b,c)+a)/(max(b,c)+a))
S cost
Log(1+min(b,c)/(a+1))**-.5
R cost
Log(1+a/(a+b))*log(1+a/(a+c))
T combined cost
Sqrt(U * S * R)
U = Log(1+(min(b,c)+a)/(max(b,c)+a))
S = Log(1+min(b,c)/(a+1))**-.5
R = Log(1+a/(a+b))*log(1+a/(a+c))
McConnoughy
(a*a - b*c) / sqrt((a+b)*(a+c))
Phi Square
power((a*d + b*c), 2) / ((a+b)*(a+c)*(b+c)*(b+d))
Forbes
n*a/((a+b)*(a+c))
Fossum
n*(a-0.5)*(a-0.5)/((a+b)*(a+c))
Stiles
log10(n*power((abs(a*d-b*c)-n/2) , 2) / ( (a+b)*(a+c)*(b+d)*(c+d) ))
Dispersion
(a*d-b*c)/power(a+b+c+d, 2)
Dennis
(a*d-b*c)/sqrt(n*(a+b)*(a+c))
Pearson Chi Square
n * power (abs(a*d-b*c)-n/2 , 2)/((a+b)*(c+d)*(a+c)*(b+d))
Mountford
2*a/(2*b*c+a*b+a*c))
Mutual Information
ln(a*n/((a+b)*(a+c)))
Weighted Mutual Information  #3
ln(power(a, 3)*n/((a+b)*(a+c)))
Chi Square with correction of Yates n/2
(n * (abs(a*d-b*c)-n/2)**2)/((a+b)*(c+d)*(a+c)*(b+d))
Normalized Collacation
a/(b+c-a)
Dunning
2*(a*log(a) + b*log(b) + c*log(c) + d*log(d))
- (a+b)*log(a+b) - (a+c)*log(a+c) - (b+d)*log(b+d) - (c+d)*log(c+d)
+ (a+b+c+d)*log(a+b+c+d)

Please note:
The software PAUP (David Swofford, Sinauer Associates, Inc. Publishers, Sunderland, Massachusetts, U.S.A.) calculates a Nei and Li coefficient for restriction sites. This measure is not the same as the Dice distance!