Distance Measures

Previous Top Next

For the calculation of similarities or dissimilarities (distances) the number of total matches (a), single matches (b, c) and no matches (d) are calculated out of the number of total positions (n = a+b+c+d).

> Tab. 1 shows how these variables are defined.

> Examples for non-normalized or normalized distance measures are given in Tab. 2.

> Formulas for several distance or similarity measures are given in Tab. 3.

Tab. 1. Variables used to calculate binary similarities/dissimilarities.

Sample	Character	1	0
Sample #1	1	a = 1/1 (in both samples)	b = 1/0 (only in sample #1)
Sample #2	0	c = 0/1 (only in sample #2)	d = 0/0 (in non of the samples)

Tab. 2. Examples for distances.

Name	Formula	Comment
Hamming distance (Manhattan, city-block, taxi-cab)	b+c	*non-normalized* distance, increases with the number of characteristics
Euclidian distance	sqrt(b+c)	*non-normalized* distance, increases with the number of characteristics
Soergel distance	(b+c)/(b+c+d)	*normalized* distance, complementary to Tanimoto: 1-a/(b+c+d)
Mean Hamming distance	(b+c)/n = (b+c)/(a+b+c+d)	*normalized* distance
Mean Euclidian distance	sqrt((b+c)/n) = sqrt(b+c)/(a+b+c+d)**2	*normalized* distance

Tab. 3. Normalized similarity and correlation measures.

Name	Similarity Formula
Jaccard	a/(n-d) [= a / (a+b+c)]
Russel & Rao	a/n
Rogers & Tanimoto	(a+d)/(a+2*(b+c)+d)
Kulczynski #1	a/(b+c)
Kulczynski #2	0.5*(a/(a+b)+a/(a+c))
Dice	2a/(2a+b+c)
Pearson's Phi coefficient	((ad)-(cb))/sqrt((a+c)(c+d)(a+b)*(b+d))
- bc -	1 - D
Baroni-Urbani/Buser	(a+sqrt(ad))/(a+b+c+sqrt(ad))
Braun-Blanquet	if (a+b)>(a+c) then S:=a/(a+b) else S:=a/(a+c)
Simpson similarity coefficient	if (a+b)<(a+c) then S := a/(a+b) else S := a/(a+c)
Michael	4(ad-bc)/((a+d)(a+d)+(b+c)*(b+c))
Sokal and Sneath #1	a/(a+2*(b+c))
SokalSneath #2	0.25 * ( a/(a+b) + a/(a+c) + d/(b+d) + d/(c+d) )
SokalSneath #3	ad/sqrt((a+b)(a+c)(d+b)(d+c))
Sokal and Sneath #4	(a+d)/(b+c)
Sokal and Sneath #5	2(a+d)/(2(a+d)+b+c)
Simple Matching	(a+d)/(a+b+c+d) =(a+d)/n
Mean Hamming	1 - D
Sneath & Sokal	(a+d)/(a +0.5*(b+c)+d)
Kocher & Wong	an/((a+b)(c+d))
Faith	(a+d/2)/n
Ochiaï #1	a/sqrt((a+b)*(a+c))
Ochiaï #2	ad/sqrt((a+b)(a+c)(d+b)(d+c))
Q0	(bc)/(ad)
Yule's Sigma	(sqrt(ad)-sqrt(bc))/(sqrt(ad)+sqrt(bc))
Yule's Q	(ad-bc)/(ad+bc)
Upholt	F = 2a/(2a+b+c) S = Power(0.5 * (-F+sqrt(FF+8F)) , (1/n));
Excoffier	n*(1-(a/n))
Hamann	(a-(b+c)+d)/n
Roux #1	(a+d) / (min(b,c)+min(n-b,n-c))
Roux #2	(n-ad) / sqrt ((a+b)(c+d)(a+c)(b+d))
Michelet	aa/(bc)
Fager & McGowan	a/sqrt((a+b)*(a+c)) + 1/sqrt(a+b)
Fager	a/sqrt((a+b)*(a+c)) - max(b,c)
Unigram subtuples	Log(ad/b/c)-3.29sqrt(1/a+1/b+1/c+1/d)
U cost	Log(1+(min(b,c)+a)/(max(b,c)+a))
S cost	Log(1+min(b,c)/(a+1))**-.5
R cost	Log(1+a/(a+b))*log(1+a/(a+c))
T combined cost	Sqrt(U * S * R) U = Log(1+(min(b,c)+a)/(max(b,c)+a)) S = Log(1+min(b,c)/(a+1))*-.5 R = Log(1+a/(a+b))log(1+a/(a+c))
McConnoughy	(aa - bc) / sqrt((a+b)*(a+c))
Phi Square	power((ad + bc), 2) / ((a+b)(a+c)(b+c)*(b+d))
Forbes	na/((a+b)(a+c))
Fossum	n(a-0.5)(a-0.5)/((a+b)*(a+c))
Stiles	log10(npower((abs(ad-bc)-n/2) , 2) / ( (a+b)(a+c)(b+d)(c+d) ))
Dispersion	(ad-bc)/power(a+b+c+d, 2)
Dennis	(ad-bc)/sqrt(n(a+b)(a+c))
Pearson Chi Square	n * power (abs(ad-bc)-n/2 , 2)/((a+b)(c+d)(a+c)*(b+d))
Mountford	2a/(2bc+ab+a*c))
Mutual Information	ln(an/((a+b)(a+c)))
Weighted Mutual Information #3	ln(power(a, 3)n/((a+b)(a+c)))
Chi Square with correction of Yates n/2	(n * (abs(ad-bc)-n/2)*2)/((a+b)(c+d)(a+c)(b+d))
Normalized Collacation	a/(b+c-a)
Dunning	2(alog(a) + blog(b) + clog(c) + dlog(d)) - (a+b)log(a+b) - (a+c)log(a+c) - (b+d)log(b+d) - (c+d)log(c+d) + (a+b+c+d)log(a+b+c+d)

Please note:

The software PAUP (David Swofford, Sinauer Associates, Inc. Publishers, Sunderland, Massachusetts, U.S.A.) calculates a Nei and Li coefficient for restriction sites. This measure is not the same as the Dice distance!