Richard O'Keefe's COSC480/490 short list for 2004

This is a sample of topics from a long list. See the full list.

Topic in XML

Title: XML Query Languages

Level: COSC480 or COSC490

XML is a notation for ordered trees, where a node may have a set of attribute=value annotations as well as a label and children.

XML is used for documents (as in XHTML), for communicating between programs (as in SOAP), and for representing "semi-structured data".

There are several query languages for XML, notably XPath (which is part of several Web standards including XSL), XML-QL, LORE, XQL, and one of my devising called XTrack. There are others which might be worth investigating too.

These query languages tend to have bulky and informal definitions. The aim of the project is to improve our understanding of several XML query languages, their capabilities, and differences.

As a COSC480 project

You would read the relevant literature and summarise it in terms of things we'd like a query language to do or be, how well each of your chosen query languages meets these criteria, and a set of sample queries showing how each can be translated into the several languages (especially queries that cannot be translated into some language).

As a COSC490 project in formal methods

You would try to develop a formal semantics for one query language and a set of sample queries, using the formal semantics and the sample queries to debug each other. (Your formal semantics should use an executable specification language.)

As a COSC490 project in implementation

You would pick one of the languages and implement it in the programming language of your choice, noting in your report any unclarities in the specification and how you resolved them.

Topic in natural language processing

Title: Finding sentence boundaries

Level: COSC490

A common fault in written and typed text is run-on sentences people just don't know where to end sentences sometimes you think the schools would do something about it. Fixing this would be a useful thing in a word processor.

Documents in ancient languages were sometimes written without explicit sentence boundaries. (What does an Egyptian hieroglyphic full stop look like? There isn't one.) For example, I have a machine-readable copy of the Old Testament which is fully morphologically analysed, but there are no full stops in it because the ancient copy on which it is based didn't have any.

The goal of the project is to develop a software component which can find likely sentence boundaries, given text where some or all of the full stops are missing. The method it uses should not depend on the human language it is applied to.

One general kind of approach to this is to train some kind of statistical model on a corpus of "full" text and then use it to predict where the boundaries should be in "unmarked" text. Hidden Markov Models have proven useful for finding boundaries in DNA and for a variety of natural language processing tasks such as part-of-speech tagging.

You would need to read some of the literature on Statistical Natural Language Processing, write your own modelling code or find and modify an existing program, obtain or construct suitable training data and test data, train your model, and test it. You might have to try more than one approach.

You may use any programming language which you determine to be suitable. You need not worry about the complexities of Unicode. You may work with English text if you prefer.

Topic in Information Retrieval

Andrew Trotman, who has worked in the Information Retrieval industry, is available to serve as co-supervisor for this and related topics.

Title: Thesaurus Searching

Level: COSC490

Implement and measure the effects of thesaurus search in information retrieval. Several public domain thesauri are available such as UMLS and Roget's. How does searching in this way affect precision and recall?

You will have to learn about basic information and about thesaurus searching. While you will would extend Andrew Trotman's search engine or some freely available engine rather than writing your own, this project will still require a reasonable amount of programming. We have the standard TREC collection of documents and queries with known relevance judgements.

You will design, perform, and analyse a lot of computer experiments. It is not so much the depth of understanding as the amount of labour which makes this a 490 project; insightful analysis of the results is also important.


Title: MIDI Classification

Level: COSC480

This is a continuation of a COSC480 project done last year. The Musical Instrument Digital Interface format is a standard format for representing music as a sequence of events. It is a little bit less abstract than common musical notation, but a lot more "symbolic" than digital recordings.

The goal is to find a way of automatically classifying music held in the form of MIDI files. This involves gathering MIDI files (we have some already, do we have enough?), extracting information from them (a free program called KeyKit can be used for that), computing features from that information (but which features?), and using clustering or some other data mining technique to form groups (but which techniques?).

There are many ways to classify music. Some (mostly slow/mostly fast; mostly loud/mostly soft; solo/ensemble) are fairly trivial. Others such as period, composer, or genre might be more interesting. Or could you come up with a different kind of classification entirely?

This year it was reported that people, asked to classify music from 10-second samples, were unable to agree with each other...