Phylogeny inference by genetic algorithm

by Richard A. O'Keefe

Imagine the task was to "write a program that uses the Genetic Algorithms (Wikipedia entry) technique to find a phylogenetic tree."

Genetic Algorithms in a Nutshell

Genetic algorithms are optimisation algorithms which search for a solution with high fitness. For a maximisation problem, we might take fitness to be the function value. For a minimisation problem, we might take fitness to be the negative of the value with some constant added. The algorithm works with a population (usually of a fixed size, typically in the hundreds) of chromosomes. A chromosome is a fixed size string which somehow encodes a solution. (Genetic Programming works with trees rather than strings, but my aim in formulating this assignment was to avoid any tree manipulation you might find difficult.)

Processing typically starts with a random population, and continues for a number of generations. Each generation involves fitness-based selection, cross-over and mutation.

Fitness-based selection means that we convert the fitness of each chromosome in the population to a non-negative number. You can do this in any way whatsoever that you like as long as the worst possible score is mapped to 0 and better scores are mapped to bigger numbers. One approach is to use a linear mapping. This mapping could be based on known bounds for the scores (the smallest possible Fitch score for a phylogeny is 0, the highest possible score is what?) or it could be relative to the current population, which is often a good idea to start with because the initial population tend not to be very good. Then you scale these numbers so that they sum to 1. Whenever we want to select a member of the population, we do so in such a way that the probability of selecting a chromosome is its scaled fitness.

In this assignment, your raw scores are the Fitch scores, which you want to minimise. So a small Fitch score must be mapped to a large positive number and a large Fitch score must be mapped to a small positive number. We normally don't want any chromosomes to be given zero fitness, unless they aren't solutions at all.

Cross-over means that we choose two chromosomes from the population (using fitness-based selection), pick a random position within the chromosomes (so if a chromosome has g genes, we pick a number c in the range 0..g, then we take the leftmost c genes from one chromosome and the rightmost g-c from the other.

Mutation means that with some low probability we pick a gene in a chromosome and change it. We need a constant trickle of mutations to ensure that each possible chromosome can be reached, but we don't need a lot because most of the work is done by cross-over. A chance of 1 in a hundred for a mutation is a good starting point. The best probability depends on the problem.

A technique called elitism is often used; at each generation the best chromosome is copied into the next generation whether it is selected or not. Sometimes more than one top chromosome is chosen. The idea is to ensure that really good solutions are never lost, and the best result at the end is the best result of any generation.

Finally we report the best result in the last generation. The number of generations is typically in the hundreds to thousands.

Here is the outline of a GA:

let NPop be the chosen population size.
let NGen be the number of generations to run for.
let Pop[0..NPop-1] be an array of chromosomes.

for i from 0 to NPop-1:
    set Pop[i] to a random chromosome.

for g from 1 to NGen:
    let Pop'[0..NPop-1] be an array of chromosomes.

    for i from 0 to NPop-1:
	compute the raw fitness of Pop[i].
    set Pop'[0] to the best element of Pop.
    when debugging, it may be useful to report the fitness of this.

    compute the information you need for fitness-based selection
    from the raw fitnesses.

    for i from 1 to NPop-1:
	choose p1 from Pop[] at fitness-based random.
	choose p2 from Pop[] at fitness-based random.
	let ch be cross-over(p1, p2).
	with a small fixed probability:
	    mutate ch.
	set Pop'[i] to ch.

    set Pop to Pop'.

Let r be the best element of Pop[].
Report the raw fitness of r.
Report the solution represented by r.

While the total fitness of each generation is rescaled to sum to 1, that does not apply to the raw fitnesses (the function value that we are trying to improve). In our case, they are Fitch scores which will almost always be much bigger than 1 and we are trying to reduce them. With any luck, we'll see the raw fitness of the best chromosome get better from generation to generation.

Applying GAs to Phylogeny.

To apply GAs in any problem, we need

A way to represent a solution as a chromosome.
A way to determine the fitness of a chromosome.
A way to do cross-over on a chromosome.
A way to mutate a chromosome.

When designing a chromosome representation, we want to be sure that

if you cross over two well formed chromosomes, the result is a well formed chromosome.
if you mutate a chromosome, you only have to make a local change to it, you don't need to "repair" the rest of the chromosome. So it has to be possible to choose a new valid gene without caring about what the rest of the chromosome looks like.

In order to make this assignment easy, I'll tell you how we might do this. There are other ways. What I want is a situation where the only operations you need on trees are

Create a tree with one OTU.
Given two trees, create a new HTU node with the given trees as its children.
Run the Fitch algorithm on a tree to find the number of changes it implies. (This score ranges from 0 to some positive integer; what is the highest possible score?)

In effect, our chromosomes will be little programs for making trees. For a problem with n OTUs, we'll need to make n-1 HTUs. Imagine making an array t[0]...t[n-1] of single-OTU trees, where n is the number of OTUs that you have been given. The chromosome is a sequence of n-2 pairs (x_i,y_i) where x_i<y_i. There is an implicit (0,1) pair at the end. We execute the program

top = n-1;
for (i = 0; i < n-2; i++) {
    int p = pair[i].x; Tree a = t[p];
    int q = pair[i].y; Tree b = t[q];

    t[p] = new_HTU(a, b);
    t[q] = t[top];
    top--;
}
t[0] = new_HTU(t[0], t[1]);

At the ith stage, we have n-i trees left (top == n-1-i), so a well formed pair (x_i,y_i) must actually satisfy
0 <= x_i < y_i < n-i.

For example, suppose we start with

// 0 1 2 3 4 5
  {a,b,c,d,e,f}

and the chromosome {(0,3),(3,4),(1,2),(0,2)}

      //  0     1 2 3 4
(0,3) => {(a,d),b,c,f,e}
      //  0     1 2 3
(3,4) => {(a,d),b,c,(f,e)}
      //  0     1     2
(1,2) => {(a,d),(b,c),(f,e)}
      //  0             1
(0,2) => {((a,d),(f,e)),(b,c)}
      //  0
(0,1) => {(((a,d),(f,e)),(b,c))}

where the last join is due to the implicit (0,1) at the end of the chromosome.

Your tasks

What I suggest you do in this lab is work on some of the things that you know you will need for your assignment.

A data structure for trees. Leaves need to include the OTU number. Internal nodes need to know their children. This data structure will be used in your program for just two things: temporarily, as a step in calculating the fitness of a chromosome, and finally, to display the best tree you found.
A version of the Fitch algorithm that computes the score of a tree. (Use the OTU number stored in a leaf to find the character data for that leaf. Store the character data in an array, which will never change once it is initialised.)
A data structure that represents a chromosome. It is basically an array of numbers, but you will find it handy to record the actual fitness (the Fitch score) and the scaled fitness (0 to 1 with high being good) as well.
A function to generate a random chromosome.
A function to cross two chromosomes over and produce a new chromosome.
A function to mutate a chromosome in place. (In effect, to generate a random chromosome we simply mutate each gene, so you will use the same building block here and in chromosome generation.)
A function to compute the scaled fitness for a population.
Code for fitness-proportionate selection.
Maybe something to read the Caminalcule data in, or you could just compile it in the way I did with C.

Hint: the easiest way to generate a random pair 0<=x<y<u is to generate two integers in the range 0..u-1, sort them, and try again if they are equal.

Hint: to generate an integer in the range 0..u-1 in C,

    #include <stdio.h>
    ...
        x = (int)(drand48() * u);
    ...

or possibly

    #include <stdio.h>
    ...
	x = (int)((random() / 2147483647.0) * u);
    ...

Do not use rand().

In Java, use Math.random().

Hint: for doing binary search in Java, you can use java.util.Arrays.binary_search() although you will need a small amount of code to fix up a negative result. In C there is a standard bsearch() function, but it is probably easier to write your own.

Hint: doing fitness-proportionate selection is tricky. It can be done in linear time if you are really cunning. The easiest method I know is O(n.lgn), which is good enough.

0.  Start by getting all scores non-negative.
1.  sum := 0.0
    for each chromosome in the population:
        sum := sum + score of that chromosome
2.  cum := 0
    for each chromosome in the population:
        scaled fitness of that chromosome := cum / sum
        cum := cum + score of that chromosome

Now we have a sequence of things labelled by a strictly increasing sequence of floating point numbers. For example,

    (24,a) (12,b) (9,c) (18,d) (31,e) (20,f)
=>
    sum = 114.0
=>  (0.0,a) (0.21,b) (0.32,c) (0.39,d) (0.55,e) (0.82,f)

Another way of saying this is that each item is now associated with an interval:

    a : [0.00,0.21)
    b : [0.21,0.32)
    c : [0.32,0.39)
    d : [0.39,0.55)
    e : [0.55,0.82)
    f : [0.82,1.00)

Now, whenever we want to choose something from this collection,

3.  Generate a random number 0 <= u < 1.
4.  Use binary search to find the pair whose associated interval
    includes u.

For example, if we generated 0.5, we would choose d.

No guarantees

Genetic algorithms often work very well, but only when you have a suitable chromosome representation and good values for parameters like population size, number of generations, probability of mutation, and so on. The particular representation given here has not been demonstrated to work well. Nor has it been demonstrated to work badly. The most I claim for it is that it makes cross-over and mutation easy and leads to very very simple tree-handling code, and you will get some kind of answer out of this. Because you are dealing with random numbers you most likely won't get the same answer as each other, and some of you will have better ones and some worse. That does not matter. The point is to get some kind of answer using an interesting technique that isn't very complicated.