Richard O'Keefe's COSC480/490 short list for 2003

This is a sample of topics from a long list. See the full list.

Topic in XML

Title: XML Query Languages

Level: COSC480 or COSC490

XML is a notation for ordered trees, where a node may have a set of attribute=value annotations as well as a label and children.

XML is used for documents (as in XHTML), for communicating between programs (as in SOAP), and for representing "semi-structured data".

There are several query languages for XML, notably XPath (which is part of several Web standards including XSL), XML-QL, LORE, XQL, and one of my devising called XTrack. There are others which might be worth investigating too.

These query languages tend to have bulky and informal definitions. The aim of the project is to improve our understanding of several XML query languages, their capabilities, and differences.

As a COSC480 project

You would read the relevant literature and summarise it in terms of things we'd like a query language to do or be, how well each of your chosen query languages meets these criteria, and a set of sample queries showing how each can be translated into the several languages (especially queries that cannot be translated into some language).

As a COSC490 project in formal methods

You would try to develop a formal semantics for one query language and a set of sample queries, using the formal semantics and the sample queries to debug each other. (Your formal semantics should use an executable specification language.)

As a COSC490 project in implementation

You would pick one of the languages and implement it in the programming language of your choice, noting in your report any unclarities in the specification and how you resolved them.

Topic in natural language processing

Title: Finding sentence boundaries

Level: COSC490

A common fault in written and typed text is run-on sentences people just don't know where to end sentences sometimes you think the schools would do something about it. Fixing this would be a useful thing in a word processor.

Documents in ancient languages were sometimes written without explicit sentence boundaries. (What does an Egyptian hieroglyphic full stop look like? There isn't one.) For example, I have a machine-readable copy of the Old Testament which is fully morphologically analysed, but there are no full stops in it because the ancient copy on which it is based didn't have any.

The goal of the project is to develop a software component which can find likely sentence boundaries, given text where some or all of the full stops are missing. The method it uses should not depend on the human language it is applied to.

One general kind of approach to this is to train some kind of statistical model on a corpus of "full" text and then use it to predict where the boundaries should be in "unmarked" text. Hidden Markov Models have proven useful for finding boundaries in DNA and for a variety of natural language processing tasks such as part-of-speech tagging.

You would need to read some of the literature on Statistical Natural Language Processing, write your own modelling code or find and modify an existing program, obtain or construct suitable training data and test data, train your model, and test it. You might have to try more than one approach.

You may use any programming language which you determine to be suitable. You need not worry about the complexities of Unicode. You may work with English text if you prefer.

Topic in Information Retrieval

Andrew Trotman, who has worked in the Information Retrieval industry, is available to serve as co-supervisor for this and related topics.

Title: Thesaurus Searching

Level: COSC490

Implement and measure the effects of thesaurus search in information retrieval. Several public domain thesauri are available such as UMLS and Roget's. How does searching in this way affect precision and recall?

You will have to learn about basic information and about thesaurus searching. While you will would extend Andrew Trotman's search engine or some freely available engine rather than writing your own, this project will still require a reasonable amount of programming. We have the standard TREC collection of documents and queries with known relevance judgements.

You will design, perform, and analyse a lot of computer experiments. It is not so much the depth of understanding as the amount of labour which makes this a 490 project; insightful analysis of the results is also important.

Topic in Data Structures and Algorithms

Title: Have you got the BLAS?

Level: COSC480 or COSC490.

There is a collection of vector and matrix processing procedures called the Basic Linear Algebra Subprograms (BLAS for short). There are three levels: vector operators, vector/matrix operations, and matrix/matrix operations. There is a model implementation in Fortran, and system vendors like Sun, DEC (now HP/Compaq), IBM, and Intel amongst others offer highly tuned implementations for their systems.

The goal of this project is to design, build, and test an object-oriented interface to some subset of the BLAS. Specifically, I would like to be able to call the BLAS from Squeak Smalltalk or GNU Eiffel. Note that the goal is to produce an interface to an existing Fortran or C library, not to re-implement the BLAS.

As a COSC480 project

You will carry out a design for the level 1 BLAS in both Smalltalk and Eiffel. You will comment insightfully on the effects that differences in the languages have on idiomatic designs. You will construct examples showing the use of your designs. Generating really good designs is harder than you might think, especially as I expect that you will need different designs for the two languages. (What does that say about UML?) Implementation is not required for COSC480.

As a COSC490 project

You will carry out a design for the level 1 and level 2 BLAS in either Smalltalk or Eiffel, implement your design, and test it. A good test of test cases is an important part of this project; I suggest using SUnit for Smalltalk or EiffelUnit for Eiffel.

The level 2 and 3 BLAS are concerned with several different array representations, this is why doing the level 2 BLAS is harder than just the level 1 BLAS.