The parsimony-based methods we have explored are
Note that using any old haphazard method is not the same as generating all trees with equal likelihood.
We can generate lots of trees very fast, but are not at all likely to find a good one.
This is almost as simple as generating random trees. It is quite fast, and finds rather good trees. It is not guaranteed to find the best.
We have to use random restarts to try and avoid local minima.
This is rather more complicated than the greedy method, but is even better at finding good solutions.
It still isn't guaranteed to find the best, and we still need random restarts.
This and exhaustive search are the only methods we considered that are guaranteed to find optimal solutions.
The price of this is exponential worst case cost. And it's not just a theoretical possibility. The method is quite practical for 14 OTUs, even 15 if you are patient, but it's quite impractical for 30 OTUs.
For a problem that the method can handle, we often find that there are many solutions all equally good, which do not agree about whether two OTUs are similar because they share a common innovation, or because they share a primitive feature that other OTUs have abandoned.
How can we cope with large numbers of species?
The Fitch algorithm for scoring trees is an approximation. The Sankoff algorithm for scoring trees is an approximation. Optimising either score is only an approximation to "real" parsimony, whatever that might be.
An earlier approach to classification used an even cruder approximation: work from just a matrix of pairwise similarities, and ignore all the fine detail of how organisms differ in favour of how much they differ.
We are thinking of OTUs as points in an abstract space, with some kind of distance between them. OTUs that are close in this space should be more closely related than OTUs that are distance in this space.
So what do we mean by "distance"?
A metric on some space is a function d(x,y) such that
We can live without the triangle inequality if we have to, but the other things are indispensible.
How does this relate to the Fitch measure, for example? Imagine a node with two children, with their sets calculated and their scores computed.
The basic algorithm goes like this:
# set up the distance matrix #
for i from 1 to N_OTUs do
for j from i+1 to N_OTUs do
dist[i][j] := d(OTU i, OTU j)
dist[j][i] := dist[i][j]
od
dist[i][i] := ?
od
# set up the array of trees #
for i from 1 to N_OTUs do
work[i] := a leaf referring to OTU i
od
# while more than one tree remains, keep on picking the closest pair and forming a new tree from them, taking care to update the distance matrix #
for remaining from N_OTUs to 2 by -1 do
best_d, best_i, best_j := HUGE, ?, ?
for i from 1 to N_OTUs - 1 do
for j from i+1 to N_OTUs do
if dist[i][j] < best_d then
best_d, best_i, best_j := dist[i][j], i, j
fi
od
od
<<update the distances>>
work[i] := a new internal node with left child work[i]
and right child work[j]
work[j] := work[remaining]
od
return work[1]
But how do we update the distances?
One approach would be to store in each node a description of the set of OTUs it represents. When you make a new internal node, you combine the descriptions in work[i] and work[j] to make the description for the new node. (You could use either the Fitch or the Sankoff algorithm for that.) I've implemented that, and it works quite well, but, surprisingly, it didn't in this case work any better than a simple scheme which computes new distances from the old distances.
<<update the distances>>
ni, nj := size(work[i]), size(work[j])
for k from 1 to remaining do
dist[i][k] := f(ni, dist[i,k], nj, dist[j,k])
od
for k from 1 to remaining do
dist[k][i] := dist[i][k]
od
OK, so what's f? There is a general parametric family
of clustering methods, but some commonly used ones are
| Single linkage | f(_,x,_,y) = min(x,y) |
| Complete linkage | f(_,x,_,y) = max(x,y) |
| Average linkage | f(m,x,n,y) = (m×x+n×y)/(m+n) |
| f(_,x,_,y) = (x+y)/2 |
The last of these is the one I called "Neighbour Joining", and the one shown here as "Average linkage" (a common Statistical name for it) is the one I called UPGMA (rightly or wrongly).
How much does all this cost? Let N be the number of OTUs.
Clearly, the total is O(N3). We couldn't possibly do better than O(N2) because we have to look at N(N-1)/2 distances, but it's a pity to take cubic time.
By keeping track of the location of the best distance in each row, we can reduce the cost of finding the best pair to O(N), and the whole cost to O(N2). You will find details in the model answer when it is available.
However, in our case, calculating a distance is O(K) where K is the number of characters, so it costs O(N.N.K) time to build the distance matrix and O(N.N.N) time to build a tree from it without any cleverness at all. N=30, and K=85, so the initial distance matrix calculation is actually the most costly part and there is no point in you trying to be clever in your assignment.
Clustering methods