COSC348: Inferring Phylogenies 5

Distance methods (clustering)

Where we have got so far

The parsimony-based methods we have explored are

Generate trees at random.
Note that using any old haphazard method is not the same as generating all trees with equal likelihood.
We can generate lots of trees very fast, but are not at all likely to find a good one.
Use a greedy method.
This is almost as simple as generating random trees. It is quite fast, and finds rather good trees. It is not guaranteed to find the best.
We have to use random restarts to try and avoid local minima.
Use a hill-climbing method.
This is rather more complicated than the greedy method, but is even better at finding good solutions.
It still isn't guaranteed to find the best, and we still need random restarts.
Use a branch-and-bound algorithm to find the optimum.
This and exhaustive search are the only methods we considered that are guaranteed to find optimal solutions.
The price of this is exponential worst case cost. And it's not just a theoretical possibility. The method is quite practical for 14 OTUs, even 15 if you are patient, but it's quite impractical for 30 OTUs.
For a problem that the method can handle, we often find that there are many solutions all equally good, which do not agree about whether two OTUs are similar because they share a common innovation, or because they share a primitive feature that other OTUs have abandoned.

How can we cope with large numbers of species?

Embrace approximation

The Fitch algorithm for scoring trees is an approximation. The Sankoff algorithm for scoring trees is an approximation. Optimising either score is only an approximation to "real" parsimony, whatever that might be.

An earlier approach to classification used an even cruder approximation: work from just a matrix of pairwise similarities, and ignore all the fine detail of how organisms differ in favour of how much they differ.

We are thinking of OTUs as points in an abstract space, with some kind of distance between them. OTUs that are close in this space should be more closely related than OTUs that are distance in this space.

So what do we mean by "distance"?

A metric on some space is a function d(x,y) such that

[Non-negative] d(x,y) >= 0 for all x, y
[Distance and identity] d(x,y) = 0 if and only if x = y
[Symmetry] d(x,y) = d(y,x)
[Triangle inequality] d(x,y) + d(y,z) >= d(x,z) for all x, y, z

We can live without the triangle inequality if we have to, but the other things are indispensible.

How does this relate to the Fitch measure, for example? Imagine a node with two children, with their sets calculated and their scores computed.

The Fitch score is clearly non-negative.
The Fitch score of a 3-node tree (OTU_1,OTU_2) is zero if and only if the two OTUs have the same characters. They might not be identical, but on the basis of the measured characters we cannot tell them apart. We should ensure that there are no such pairs in our data; if we find that we cannot distinguish two OTUs, it's time to pick another character that does distinguish them and measure that. As long as we do that, this requirement will be satisfied.
The Fitch score of (a,b) is clearly identical to that of (b,a).
The triangle inequality does hold for triples of OTUs because it's then a simple count of differences.

Clustering

The basic algorithm goes like this:

# set up the distance matrix #

for i from 1 to N_OTUs do
    for j from i+1 to N_OTUs do
	dist[i][j] := d(OTU i, OTU j)
	dist[j][i] := dist[i][j]
    od
    dist[i][i] := ?
od

# set up the array of trees #

for i from 1 to N_OTUs do
    work[i] := a leaf referring to OTU i
od

# while more than one tree remains, keep on picking the closest pair and forming a new tree from them, taking care to update the distance matrix #

for remaining from N_OTUs to 2 by -1 do
    best_d, best_i, best_j := HUGE, ?, ?
    for i from 1 to N_OTUs - 1 do
        for j from i+1 to N_OTUs do
            if dist[i][j] < best_d then
                best_d, best_i, best_j := dist[i][j], i, j
	    fi
	od
    od
    <<update the distances>>
    work[i] := a new internal node with left child work[i]
                                   and right child work[j]
    work[j] := work[remaining]
od
return work[1]

But how do we update the distances?

One approach would be to store in each node a description of the set of OTUs it represents. When you make a new internal node, you combine the descriptions in work[i] and work[j] to make the description for the new node. (You could use either the Fitch or the Sankoff algorithm for that.) I've implemented that, and it works quite well, but, surprisingly, it didn't in this case work any better than a simple scheme which computes new distances from the old distances.

<<update the distances>>

ni, nj := size(work[i]), size(work[j])
for k from 1 to remaining do
    dist[i][k] := f(ni, dist[i,k], nj, dist[j,k])
od
for k from 1 to remaining do
    dist[k][i] := dist[i][k]
od

OK, so what's f? There is a general parametric family of clustering methods, but some commonly used ones are

Single linkage f(_,x,_,y) = min(x,y)
Complete linkage f(_,x,_,y) = max(x,y)
Average linkage f(m,x,n,y) = (m×x+n×y)/(m+n)
f(_,x,_,y) = (x+y)/2

The last of these is the one I called "Neighbour Joining", and the one shown here as "Average linkage" (a common Statistical name for it) is the one I called UPGMA (rightly or wrongly).

How much does all this cost? Let N be the number of OTUs.

Setting up the distance matrix costs O(N²) calls to the distance calculation.
Setting up the array of trees (initially all leaves) is O(N).
Finding the smallest distance takes O(N²) time, and it's inside a loop that goes around N-1 times, so that's O(N³) time.
Updating the distance matrix takes O(N) time and is inside a loop that goes around N-1 times, so that's O(N²).

Clearly, the total is O(N³). We couldn't possibly do better than O(N²) because we have to look at N(N-1)/2 distances, but it's a pity to take cubic time.

By keeping track of the location of the best distance in each row, we can reduce the cost of finding the best pair to O(N), and the whole cost to O(N²). You will find details in the model answer when it is available.

However, in our case, calculating a distance is O(K) where K is the number of characters, so it costs O(N.N.K) time to build the distance matrix and O(N.N.N) time to build a tree from it without any cleverness at all. N=30, and K=85, so the initial distance matrix calculation is actually the most costly part and there is no point in you trying to be clever in your assignment.

Conclusion

Clustering methods

are polynomial time, not exponential
are usable for thousands of OTUs
are quite fast in practice
generate reasonably results
do not generate most parsimonious results and don't even try to.

Single linkage	f(_,x,_,y) = min(x,y)
Complete linkage	f(_,x,_,y) = max(x,y)
Average linkage	f(m,x,n,y) = (m×x+n×y)/(m+n)
	f(_,x,_,y) = (x+y)/2