# COSC345 Assignment 4 — k-Means S

## Student number:

Suppose we have measured a collection of “cases” that we would like to divide into k groups, so that each group is pretty much the same, and the groups are different. Also suppose that we have measured some numeric “features” for each of thes cases, and that we define two cases to be “similar” if the Euclidean distance between their feature vectors is small.

The k-means algorithm is a famous algorithm for doing this. You have to find your cases and measure them, and choose k, but the algorithm takes it from there.

An outer iteration goes

• choose k cases at random to be cluster centres
• improve these centres

and we repeat that several times and take the best result.

An inner iteration goes

• associate each case with the cluster centre it is closest to
• replace each cluster centre by the mean of the feature vectors now associated with it.

and we repeat that at most some number of times or until it stops improving.

• Is this the first, second, or third program you looked at? (Put a circle around one.)
• Write the time:

• Read the k-Means listing until you feel that you have a rough idea of what is going on.
• Write the time:

• Take a break if you like.
• Write the time:

• What is the input to this program?

• The random assignment procedure uses a well known algorithm to choose a random subset of cases. Give a big-oh formula for the cost of this step as a function of the integer inputs to the program.

• Is the whole program certain to terminate in a finite time? If so, give a simple bound in terms of the integer inputs. If not, explain why not.

• Write the time:

• Take a break if you want one.
• Write the time:

• The program contains a deliberate mistake. What is it?

• Write the time: