COSC348: Inferring Phylogenies 5

Distance methods (clustering)

Where we have got so far

The parsimony-based methods we have explored are

How can we cope with large numbers of species?

Embrace approximation

The Fitch algorithm for scoring trees is an approximation. The Sankoff algorithm for scoring trees is an approximation. Optimising either score is only an approximation to "real" parsimony, whatever that might be.

An earlier approach to classification used an even cruder approximation: work from just a matrix of pairwise similarities, and ignore all the fine detail of how organisms differ in favour of how much they differ.

We are thinking of OTUs as points in an abstract space, with some kind of distance between them. OTUs that are close in this space should be more closely related than OTUs that are distance in this space.

So what do we mean by "distance"?

A metric on some space is a function d(x,y) such that

  1. [Non-negative] d(x,y) >= 0 for all x, y
  2. [Distance and identity] d(x,y) = 0 if and only if x = y
  3. [Symmetry] d(x,y) = d(y,x)
  4. [Triangle inequality] d(x,y) + d(y,z) >= d(x,z) for all x, y, z

We can live without the triangle inequality if we have to, but the other things are indispensible.

How does this relate to the Fitch measure, for example? Imagine a node with two children, with their sets calculated and their scores computed.

  1. The Fitch score is clearly non-negative.
  2. The Fitch score of a 3-node tree (OTU_1,OTU_2) is zero if and only if the two OTUs have the same characters. They might not be identical, but on the basis of the measured characters we cannot tell them apart. We should ensure that there are no such pairs in our data; if we find that we cannot distinguish two OTUs, it's time to pick another character that does distinguish them and measure that. As long as we do that, this requirement will be satisfied.
  3. The Fitch score of (a,b) is clearly identical to that of (b,a).
  4. The triangle inequality does hold for triples of OTUs because it's then a simple count of differences.

Clustering

The basic algorithm goes like this:

# set up the distance matrix #

for i from 1 to N_OTUs do
    for j from i+1 to N_OTUs do
	dist[i][j] := d(OTU i, OTU j)
	dist[j][i] := dist[i][j]
    od
    dist[i][i] := ?
od

# set up the array of trees #

for i from 1 to N_OTUs do
    work[i] := a leaf referring to OTU i
od

# while more than one tree remains, keep on picking the closest pair and forming a new tree from them, taking care to update the distance matrix #

for remaining from N_OTUs to 2 by -1 do
    best_d, best_i, best_j := HUGE, ?, ?
    for i from 1 to N_OTUs - 1 do
        for j from i+1 to N_OTUs do
            if dist[i][j] < best_d then
                best_d, best_i, best_j := dist[i][j], i, j
	    fi
	od
    od
    <<update the distances>>
    work[i] := a new internal node with left child work[i]
                                   and right child work[j]
    work[j] := work[remaining]
od
return work[1]

But how do we update the distances?

One approach would be to store in each node a description of the set of OTUs it represents. When you make a new internal node, you combine the descriptions in work[i] and work[j] to make the description for the new node. (You could use either the Fitch or the Sankoff algorithm for that.) I've implemented that, and it works quite well, but, surprisingly, it didn't in this case work any better than a simple scheme which computes new distances from the old distances.

<<update the distances>>

ni, nj := size(work[i]), size(work[j])
for k from 1 to remaining do
    dist[i][k] := f(ni, dist[i,k], nj, dist[j,k])
od
for k from 1 to remaining do
    dist[k][i] := dist[i][k]
od

OK, so what's f? There is a general parametric family of clustering methods, but some commonly used ones are
Single linkage f(_,x,_,y) = min(x,y)
Complete linkagef(_,x,_,y) = max(x,y)
Average linkage f(m,x,n,y) = (m×x+n×y)/(m+n)
f(_,x,_,y) = (x+y)/2

The last of these is the one I called "Neighbour Joining", and the one shown here as "Average linkage" (a common Statistical name for it) is the one I called UPGMA (rightly or wrongly).

How much does all this cost? Let N be the number of OTUs.

Clearly, the total is O(N3). We couldn't possibly do better than O(N2) because we have to look at N(N-1)/2 distances, but it's a pity to take cubic time.

By keeping track of the location of the best distance in each row, we can reduce the cost of finding the best pair to O(N), and the whole cost to O(N2). You will find details in the model answer when it is available.

However, in our case, calculating a distance is O(K) where K is the number of characters, so it costs O(N.N.K) time to build the distance matrix and O(N.N.N) time to build a tree from it without any cleverness at all. N=30, and K=85, so the initial distance matrix calculation is actually the most costly part and there is no point in you trying to be clever in your assignment.

Conclusion

Clustering methods