Cluster analysis for phylogeny inference

by Richard A. O'Keefe

Cluster analysis (Wikipedia entry) is all about automatically dividing a collection of objects into clusters. If the clusters form a tree, we have hierarchical clustering; if they form a partition, we have partitional clustering; if they overlap, we have fuzzy clustering.

Since we want trees, the kind of cluster analysis we are interested in is hierarchical clustering. There are two main ways to do that: divisive or top-down methods start at the top and keep on splitting the collection, while agglomerative or bottom-up methods start at the leaves and keep on joining things.

Bottom-up agglomerative clustering

In order to do the bottom up clustering, you need a tree data type and something to hold the distance matrix.

In C, it really helps if we have compile-time bounds for the number of OTUs and the number of characters. So

#define MAX_OTUS 100
#define MAX_CHRS 200

We need those bounds so that we know how big to make the OTU×character matrix (which we only need while reading the data; what we keep is an OTU×OTU distance matrix). But MAX_OTUS gives us a bound on the maximum tree size, so we could preallocate the space for that. You may prefer to use malloc().

What about the data matrix? Well, for each cluster we shall need to know

  1. the size of the cluster (the number of OTUs it contains)
  2. the tree we've built for it
  3. the distance to each other cluster

There are two ways to represent this: as a record of arrays or as an array of records.

As a record of arrays:

static struct {
    Tree tree[MAX_OTUS];
    int size[MAX_OTUS];
    double dist[MAX_OTUS][MAX_OTUS];
} work;

We don't actually need the struct; we can just have the three separate variables. If you want a Java class, this could be your class. The clustering algorithm would go inside that class.

As an array of records:

struct Work {
    Tree tree;
    int size;
    double dist[MAX_OTUS];
};
static struct Work work[MAX_OTUS];

You could make a Java class out of this, but there really isn't anything much you would have reason to put inside it by way of methods. The array of records approach is better for C than for Java.

How do we compute the distances?

Suppose we have

char data[MAX_CHRS][MAX_OTUS];
int n_chrs; /* number of characters */
int n_OTUs; /* number of OTUs */

Since I have advised you to use the Fitch algorithm for the other methods, and the Fitch algorithm just counts differences, and since I recommended that you treat 'x' as just another value, we might as well use a simple count of differences in this program. So

    for (i = 0; i < n_OTUs; i++) {
        for (j = 0; j < n_OTUs; i++) {
            d = 0;
            for (k = 0; k < n_chrs; k++) {
                d += data[k][i] != data[k][j];
            }
            work.dist[i][j] = d;  /* record of arrays */
            work[i].dist[j] = d;  /* array of records */
        }
    }

The idea is that we start by putting each OTU into its cluster, so we start with n_OTUs clusters each of size 1 and with a 1 element tree.

    for (i = 0; i < n_OTUs; i++) {
        /* record of arrays version */
        work.size[i] = 1;
        work.tree[i] = a new leaf holding the OTU number i+1;
        /* array of records version */
        work[i].size = 1;
        work[i].tree = a new leaf holding the OTU number i+1;
    }

We shrink the number of clusters down to 1, maintaining the invariant that the remaining clusters are 0..number_of_clusters-1. This means that we are using the top left corner of the distance matrix. At each iteration of the outer loop, we find the closest clusters, call them a and b. Swap these numbers if necessary so that a < b. We have this diagram:

  0        a          b          L          n_OTUs
 +--------+-+--------+-+--------+-+--------+
0|        | |        | |        | |        |
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 +--------+-+--------+-+--------+-+--------+
a|        | |        | |        | |        |
 +--------+-+--------+-+--------+-+--------+
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 +--------+-+--------+-+--------+-+--------+
b|        | |        | |        | |        |
 +--------+-+--------+-+--------+-+--------+
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 +--------+-+--------+-+--------+-+--------+
L|        | |        | |        | |        |= number of clusters-1
 +--------+-+--------+-+--------+-+--------+
 |        | |        | |        | |        |number of clusters
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 |        | |        | |        | |        |
 +--------+-+--------+-+--------+-+--------+
n_OTUs

We're going to combine row [a] with row [b] to get the distances from the combined ab cluster to all the other clusters. Since the distance matrix is supposed to be symmetric, we'll copy row [a] to column [a]. Row [b] is now dead, so we'll squeeze it out by copying row [L] to row [b], and of course we'll need to copy row [b] to column [b] as well to keep the matrix symmetric.

Finding the minimum distance is easy. The diagonal elements should all be zero, and we don't want to look at those. Since the matrix is symmetric, there's no point in searching the same information twice. So we'll only search over column numbers that are less than the row number.

    for (L = n_OTUs-1; L > 1; L--) {
        c = 1.0e100; /* any really big number */
        for (row = 1; row <= L; row++)
            for (col = 0; col < row; col++)
                if (work[row].dist[col] < c) {
                    a = col;
                    b = row;
                    c = work[row].dist[col];
                }
        /* combine a and b */
    }

What do we do to combine a and b?

  1. Adjust the distances
  2. Increase the size of a
  3. Build a new tree node
  4. Move row [L] to row [b]

You may choose either one of the two formulas:

    work.dist[a][i] = (work.dist[a][i] + work.dist[b][i])/2

or

    work.dist[a][i] = ( work.dist[a][i] * work.size[a]
                      + work.dist[b][i] * work.size[b]
                      ) / (work.size[a] + work.size[b])

There are many other formulas. For example, "Single Linkage" uses

    work.dist[a][i] = min(work.dist[a][i], work.dist[b][i])

and "Complete Linkage" uses

    work.dist[a][i] = max(work.dist[a][i], work.dist[b][i])

But you should use one of the two formulas I listed first. Remember that you want to iterate from i=0 to i=L inclusive. After calculating these distances, you will want to copy row [a] to column [a]:

    for (row = 0; row <= L; row++)
        work.dist[row][a] = work.dist[a][row];

To increase the size of a, just add work[b].size to work[a].size.

To build a new tree node, just allocate an internal node, make work[a].tree and work[b].tree its children, and assign it to work[a].tree.

Finally, we're about to lose row/column L, so we copy that to the dead row/column b, which you can figure out.

An important note about this is that as I've presented it, the cost of building a tree given a distance matrix is O(n_OTUs**3). There is a simple way of fixing that. (About 20 lines of my program are devoted to initialising and preserving the auxiliary data for this.) For each row of the matrix, you keep track of where its smallest element is. Finding the smallest element in the whole matrix then costs O(n_OTUs). Finding the smallest element in the new combined row is clearly O(n_OTUs). For all the other rows, the smallest element is either the previous smallest element or the new combined element, so that's O(1) per row, or O(n_OTUs) altogether. This brings the cost of finding the smallest distance to O(n_OTUs) per iteration instead of O(n_OTUs**2).

The cost of reading the character data and building the distance matrix is O(n_OTUs**2.n_chrs). But there are more characters than OTUs in this problem, so we can expect that the bulk of the time will go in just reading and building the distance matrix even without this optimisation. In fact, my model answer has three parts: (1) reading the data and building the distance matrix, (2) building the tree, and (3) printing the tree. The one that takes the least time for this data set (using 'cc -p' and 'prof' to find out) is (2). So for a problem of this size there's no need to worry about the cubic cost.