COSC480/490 Project Topics for 2006

Most of these topics are listed as (480/490). That means that they could be done by a COSC 480 student or a COSC 490 student. Since COSC 480 is supposed to be half as much work as COSC 490, the scope of each project will be adjusted to suit. Negotiating the project scope to suit your time, skills, and interests is an important part of doing a 400-level project, and should be done early.

If you have a topic of your own that you would like to do, by all means ask me or any lecturer in the Department whether we would be willing to supervise it. Take a look at my full topic lists from 2003 and 2004 to get an idea of what I might be interested in. (In 2005 the departmental policy changed so that we had to offer short lists only.)

Supervised by Richard O'Keefe alone

Data Quality (480/490)

It has been estimated that 20 of the fields in real business databases are wrong. The best time to fix this is at the point of entry.

There is a Dunedin business working on this problem. They're looking at the problem of automatically fixing up street addresses as they are entered, mainly using data provided by the Post Office. They have suggested it to us as a research topic, and it's a good one.

There are many ways to approach this:

algorithmic: adapt ternary search trees to do nearest neighbour searches by Levenstein distance instead of Hamming distance
algorithmic: would splay ternary search trees be better than simple ternary search trees?
algorithmic: would N-grams do a better job than ternary search trees?
algorithmic: given a street address to latitude-and-longitude mapping, how can you use that to efficiently compute adjacency and (for example) resolve the common error of the suburb name being the (correctly) spelled name of an adjacent suburb?
algorithmic: given a street address to latitude-and-longitude mapping which is incomplete, can you use other postal address information to compute approximate geographical location?
architectural: how can you use context to focus spelling correction?
application: how can you use ideas shared with address clean-up to help with names?
application: how can you use ideas shared with address clean-up to help with free text from a particular domain (general medical practice)?
user interface: what's a good way to build address clean-up (or name clean-up) into a user interface?
natural language processing: can you find New Zealand postal addresses in Web pages, e-mail, or other electronic data sources?

Information Retrieval on the Cell (480/490)

IBM, Sony, and Toshiba have announced a new processor architecture, the Cell. For a description, see this link. There's also an on-line tutorial from IBM, and I have some manuals in electronic form. This is to be the engine for the PlayStation 3. Each Cell contains a processing unit which is basically a POWER5 (roughly, the next generation after the G5 used in current Macs) plus 8 64-bit vector processing units. 256GFLOPS peak performance at 4.6GHz.

The purpose of this project is to investigate ways in which an Information Retrieval engine could exploit the vector processing units. This is an architecture and design topic, not a detailed implementation one.

Machine Learning--Decision Trees (480/490)

Decision trees are trees where the internal nodes test a variable and the leaves assign cases to classes. They have been studied for several decades. However, what counts as a "good" tree depends on what you want it for.

statisticians, following the famous "Classification And Regression Trees" book, want a tree that makes good predictions, whatever it looks like;
Nathan Rountree wants trees that provide a good architecture for neural nets to grow on;
machine learning people want trees that work well as new "knowledge"; simplicity and economy are good to have.

Decision tree building these days involves first growing a tree to fit the training data precisely and then pruning it back to avoid over-fitting. This project is about trying to improve both growing and pruning, to make the trees work better as "knowledge". Here are two ideas:

A decision procedure that uses fewer variables is better for people. When choosing which variable to test next, one approach would be to add some kind of penalty to variables which have not been tested yet on the current path. Another approach would try to minimise the number of variables used anywhere in the tree.
Suppose you have 100 cases in class 1 and 50 cases in class 2. One tree might just be a leaf, assigning all cases to class 1. That would make 50 mistakes. Another tree might split on some variable and have two leaves, one containing 50 cases all in class 1 and the other containing a 50/50 mix. Whichever class is assigned to the mixed leaf, that tree will make 50 mistakes too. From a statistical point of view, both trees are equally good because both make the name number of correct predictions. From a knowledge point of view, the second tree might be better because it has carved off part of the world where it never makes mistakes.

A 480 project would explore these specific issues on a collection of problems. A 490 project would seek additional ways in which "knowledge" is different from prediction formulas and try to improve them as well.

Evolutionary Programming for Information Retrieval (490)

In the usual approach to Information Retrieval, each possibly relevant document is assigned a weight based on the number of times each term in the query occurs in that document.

When you have a structured document, it makes sense that term occurrences in some parts should be more informative than term occurrences in other parts, and this turns out to be true. But how do we find suitable part-dependent weights? People turn out to be very bad at doing this.

A number of people, including our own Andrew Trotman, have explored this using Genetic Algorithms. Genetic Algorithms represent a candidate solution as a string of symbols, and "breed" a solution using fitness-dependent mating, cross-over (where bits of one solution are swapped with bits of another), and just a touch of mutation. This is supposed to be like evolution in freely mixing populations of two sexes.

Genetic Algorithms do work for this problem, but there might be a better way. Evolutionary Programming represents a candidate solution as a vector of real numbers, and "evolves" a solution using selection and mutation, without any crossover.

The project is to try to reproduce Andrew Trotman's results, but using Evolutionary Programming instead of Genetic Algorithms. Since there are several minor variations on the EP idea, this will involve running lots of experiments, which is why this is rated a 490 project. If EP can do as well as GA or better, the result will be publishable in the EP literature. If it can't, the result will be publishable in the GA literature.

The ideas in this project are not particularly complex, but the programming has to succeed or it will not be possible to do the experiments.

Shared with Andrew Trotman

There's a set of information retrieval topics which Andrew and I will co-supervise. Andrew has worked on Information Retrieval in the Real World.

Retro-Computing (480/490)

In the early 1980s the Wellington Polytechnic and the Department of Education designed and build a computer for high schools. This computer, the Poly, and its successor the Poly 2, were manufactured in New Zealand and sold not only here, but also overseas. The CPU was a Motorola 6809. It had built-in networking and used a remote file server. The operating system was FLEX. This is pretty much all that is currently known about the machine. The primary aim is to track down and document as much as possible about this computer. After this, the project will proceed depending on circumstance.

Andrew has found someone who is willing to sell us a real Poly.

If the purchase goes through and the machine works, it is to be archived (disc images and ROMS moved to the PC).
If the purchase goes through and the machine does not work, If a broken example is found then build a software emulator of as much as is known.
If the purchase does not go through, interviews with key engineers, systems administrators, and users will help document the design and implementation of the Poly.

Search engine ranking for long documents (480/490)

Traditional search engines search and rank whole documents--not a good thing to do if those documents are very long. There already are two technologies for searching parts of documents: element retrieval, and passage retrieval. Element retrieval is used to search XML documents and to identify relevant XML elements. Passage retrieval is used when there is no markup--these algorithms identify relevant passages of text. But how do the these approaches compare head-on? This project involves writing a search engine and experimenting with element and passage ranking algorithms. First a survey of current algorithms for element and passage retrieval will be conducted. A search engine will be written. Then the two approaches will be compared. A standard IR test collection (the INEX collection) will be used and quantitative analysis used for comparison.

Learning to stem using genetic programming (480/490)

Some search engines automatically search for variants of your search terms. If you search for "computer" the search engine will also find pages containing the words "computers", "computing", "computed", and so on. The process of converting all these different morphologically derived variants into one common variant is called stemming. Genetic Programming (GP) is a machine learning technique in which the computer learns a program to solve a given problem. In this investigation GP will be used to learn word morphology rules within a search engine. These learned rules will be compared to previously published human-written rules. An analysis of the results will help further our understanding of word morphology.

Recommending geocaches (480/490)

Geocaching is an orienteering-like game played with a GPS receiver. In this high-tech game players search for a hidden box of goodies (called a geocache) given only a GPS location stated in longitude and latitude (and a GPS receiver). There are nearly 230,000 geocaches hidden in 220 countries, and over 100 around Dunedin--so which is your first find? Existing geocaching web sites recommend caches based purely on distance from your current location, but significantly better approaches exist. In prior research collaborative filtering was shown to be a successful way to recommend to players in their home towns, but what about recommenders for players who play on holiday and in different cities? Part of this project includes placing and finding geocaches in Dunedin. The aim is to build a recommender.