Richard O'Keefe's COSC480/490 full list for 2004

This is the full list of COSC480/COSC490 topics from which the short list was extracted.

There are seven groups of topics:

XML;
natural language processing;
information retrieval;
data structures and algorithms;
compilers;
software engineering;
others.

Topics about XML

Title: XML Query Languages

Level: COSC480 or COSC490

XML is a notation for ordered trees, where a node may have a set of attribute=value annotations as well as a label and children.

XML is used for documents (as in XHTML), for communicating between programs (as in SOAP), and for representing "semi-structured data".

There are several query languages for XML, notably XPath (which is part of several Web standards including XSL), XML-QL, LORE, XQL, and one of my devising called XTrack. There are others which might be worth investigating too.

These query languages tend to have bulky and informal definitions. The aim of the project is to improve our understanding of several XML query languages, their capabilities, and differences.

As a COSC480 project

You would read the relevant literature and summarise it in terms of things we'd like a query language to do or be, how well each of your chosen query languages meets these criteria, and a set of sample queries showing how each can be translated into the several languages (especially queries that cannot be translated into some language).

As a COSC490 project in formal methods

You would try to develop a formal semantics for one query language and a set of sample queries, using the formal semantics and the sample queries to debug each other. (Your formal semantics should use an executable specification language.)

As a COSC490 project in implementation

You would pick one of the languages and implement it in the programming language of your choice, noting in your report any unclarities in the specification and how you resolved them.

Title: XML Document Type Definition induction

Level: COSC480 or COSC490

An XML document is basically a tree, and a DTD is a grammar for these trees. The sequence of children an element may have is constrained by a regular expression. It is useful to construct DTDs by example, especially for compression.

Several methods have been described in the literature, and I have an unpublished one. A critical review of several of these methods would make a good COSC480 project; experimental comparisons would complete a COSC490 project.

Title: Understanding XPath

Level: COSC490

XPath is a query language for XML. It is a notation for selecting a set of nodes in a structured document. Unfortunately, the specification is not as clear as it could be, and it seems to be missing a number of important features.

The goal here is to develop a clear formal specification of XPath, test it with publicly available test cases, and find some transformation property that might be useful for query optimisation and either show that it is valid or show that it is not.

The emphasis is on a clear theoretical understanding of XPath.

Title: Implement or analyse my XML Query Language

Level: COSC480 or COSC490

I have a design for a clean simple XML Query Language called XTrack. It is based on regular expressions. I'd like an implementation of it in any programming language and a library of test cases similar to the ones available for XPath. An efficient implementation, and benchmarking against libxslt would be nice, making a good COSC490 project.

Alternatively, an implementation in a high level language such as Prolog, Smalltalk, or Haskell and a good analysis of the relative strengths and weaknesses of XTrack and XPath (or some other XML query language) would make a COSC480 project.

Title: A Simple XML query language

Level: COSC490

At the moment, the easiest way to extract information from an XML document is to write a program which processes the document according to the tags it sees. Several attempts have been made at ad hoc XML query languages (XML-QL, LORE, XQL and XSL); these languages are big and hard to "get a grasp of". In this project you will design a small language for querying and transforming XML documents and provide a simple implementation for it, and compare it with the others.

Designing your own query language makes this a COSC 490 project.

Title: Explaining Cascading Style Sheets

Level: COSC490

Cascading Style Sheets are the notation that HTML 4 and XHTML use for controlling the appearance of web pages. Basically, a CSS rule specifies a context and a bundle of attribute-value pairs.

The CSS1 and CSS2 specifications from the World-Wide Web Consortium explain CSS in rather fuzzy English; the book by the inventors (which is in the Science library) is no better.

The task of this project is to develop a formal specification that says precisely which attribute-value pairs are to be associated with each node in the parse tree of an HTML or XML document by any given CSS sheet.

Ideally the specification would be tested by converting it into executable form, say as Haskell or Mercury, but that is not a requirement.

This isn't as easy as it sounds; it's a COSC 490 project.

Title: Implementing Unicode

Level: COSC480 or COSC490

ASCII is a 7-bit character set, with 95 printing characters. ISO Latin 1 is an 8-bit character set, with 191 printing characters. Unicode (which is for all practical purposes identical to ISO 10646) is a 21-bit code in which there are currently about 100,000 printing characters. This is the character set that XML is based on.

For a variety of reasons (mainly humanity's creativity at inventing scripts and boneheadedness at developing standards) Unicode is quite complex to process. Let me rephrase that: stunningly complex. There are several useful, feasible, but challenging sub-projects:

Design a UNICODE_STRING class for Eiffel or Smalltalk
Implement Unicode normalisation
Implement Unicode case conversion and matching
Implement Unicode line breaking
Implement Unicode boundary recognition (character, word, line, ...)
Implement a UNIX-style regular expression matcher that supports Unicode (Mark Hopkins' programs might be a good place to start).
Implement a simple formatter (based on source code that will be provided) that handles bidirectional text (for simplicity, I'd recommend (English,Hebrew), but if you want to handle Arabic and understand the problems, go ahead).
Implement Unicode string comparison.
Figure out a way to do string searches in Unicode without normalising the strings first.
Explore Unicode compression, especially TR-6.
Your topic in this area.

Topics in natural language processing

Title: Finding sentence boundaries

Level: COSC490

A common fault in written and typed text is run-on sentences people just don't know where to end sentences sometimes you think the schools would do something about it. Fixing this would be a useful thing in a word processor.

Documents in ancient languages were sometimes written without explicit sentence boundaries. (What does an Egyptian hieroglyphic full stop look like? There isn't one.) For example, I have a machine-readable copy of the Old Testament which is fully morphologically analysed, but there are no full stops in it because the ancient copy on which it is based didn't have any.

The goal of the project is to develop a software component which can find likely sentence boundaries, given text where some or all of the full stops are missing. The method it uses should not depend on the human language it is applied to.

One general kind of approach to this is to train some kind of statistical model on a corpus of "full" text and then use it to predict where the boundaries should be in "unmarked" text. Hidden Markov Models have proven useful for finding boundaries in DNA and for a variety of natural language processing tasks such as part-of-speech tagging.

You would need to read some of the literature on Statistical Natural Language Processing, write your own modelling code or find and modify an existing program, obtain or construct suitable training data and test data, train your model, and test it. You might have to try more than one approach.

You may use any programming language which you determine to be suitable. You need not worry about the complexities of Unicode. You may work with English text if you prefer.

Title: Project Gutenberg auto-markup

Level: COSC480 or COSC490

Project Gutenberg is an ambitious project whose aim is to provide as much as possible of the world's literature in free electronic form. Volunteers produce plain-text versions of out-of-copyright works, upload them, and the rest is a free reading frenzy.

Currently they have about 2GB of text on offer. Most of this is English, because that's what they've been given. Some texts are in other languages.

Plain texts such as books are not marked up. Yet they have structure. Your mission, should you choose to accept it, is to devise an SGML or XML "grammar" for books (I will provide lots of help with that), and a method of converting plain text Project Gutenberg books to this format that works with at least 3 books. Python or Perl might be suitable languages for writing the converter in, but that's up to you.

Recovering structure from existing semi-structured or even unstructured documents is a real task for many organisations.

As a COSC480 project, you would study a small number of books (maybe four), devise a block-level structure, write code to convert those books to that structure, and test your code on a small number of different books (maybe two or three).

As a COSC490 project, you would study and test on a larger number of books, devise a richer structure, and do a more thorough literature survey.

Title: Tagging Mäori with Hidden Markov Models

Level: COSC490

Hidden Markov Modelling is a machine learning technique that learns an approximate grammar for a language by fitting a finite state automaton with random transitions.

Parts of speech are things like "mass noun", "count noun", "adjective", "article", "preposition", "transitive verb", "benefactive verb". Part of speech tagging is taking plain text and annotating each word with its part of speech.

We have software that uses Hidden Markov Models tag unrestricted English text with the part of speech of each word, including words it has never seen before. This project involves applying that technique to Mäori, and evaluating how well the technique works. This requires finding or building a training corpus of already tagged Mäori text.

For processing historical texts, we'd like to be able to recover vowel length from text where it isn't indicated; if time permits it would be useful to try to extend your model to handle this.

Title: Morphological Analyser for some language

Level: COSC490

A morphological analyser is a program which can break a word like "unavoidably" into meaningful pieces like UN+AVOID+ABLE+LY. Such components are useful in computational linguistics, and for information retrieval in morphologically rich languages.

Write a morphological analyser for a non-European language. Previous theses (for Indonesian and Persian) will be available to see how it's done. There is a free tool called PC-KIMMO that makes it a lot easier to do this, but you may use any tools you choose.

You need not be a native speaker of the language you choose to study, nor even especially fluent. All the information you require is in books.

A particular challenge for Mäori; is the three different orthographies in use (differing in whether/how they indicate vowel length); can you use context to recover vowel length?

Topics in Information Retrieval

Andrew Trotman, who has worked in the Information Retrieval industry, is available to serve as co-supervisor for this and related topics.

Title: Thesaurus Searching

Level: COSC490

Implement and measure the effects of thesaurus search in information retrieval. Several public domain thesauri are available such as UMLS and Roget's. How does searching in this way affect precision and recall?

You will have to learn about basic information and about thesaurus searching. While you will would extend Andrew Trotman's search engine or some freely available engine rather than writing your own, this project will still require a reasonable amount of programming. We have the standard TREC collection of documents and queries with known relevance judgements.

You will design, perform, and analyse a lot of computer experiments. It is not so much the depth of understanding as the amount of labour which makes this a 490 project; insightful analysis of the results is also important.

Title: Stemming for Information Retrieval in some language

Level: COSC480 or COSC490

Information retrieval systems for English basically work by

throwing out high frequency words like "and", "the", and so on (this is called "stopping");
stripping off affixes (for example, uncommonly = un+common+ly => common) with an emphasis on doing it fast rather than with linguistic perfection;
computing some measure of similarity between the query and each of a set of documents (for a small collection we might scan the lot; large collections are generally pre-indexed; but that is not relevant to this topic);
reporting the documents in descending order of similarity.

This project involves

finding a set of documents that are available in both English and your choice of language (chapters of the Bible would do nicely, or government publications) in machine readable form;
finding or developing a set of queries with known matches and translating them to the other language;
developing a "stop" list for your chosen language (that is, words not to be considered (the obvious candidates are closed class words like pronouns, prepositions, particles, articles and so on);
developing an algorithm for "stemming" in your chosen language. As a rule, inflectional morphology (like "-ed", "-s" in English) should be stripped, derivational morphology (like "-er", "-ly", and "un-" in English, or "(C)anga", "(C)ia" and "whaka-" in Mäori) probably should not be stripped, and compound words might be split or might not. Some of these ideas might make things worse.

The first task for this project will be to find out what has already been tried for your chosen language.

Whenever you use a search engine on the Web, you are doing information retrieval. HTML 4 and XHTML provide a way to tag (parts of documents) with the language they are written in, so that appropriate conventions may be used. Your ideas could end up in a search engine.

We have access to the source code of an IR engine, so the amount of coding required doesn't have to be large.

Title: 4th Year Project Library

Level: COSC490

Design a searchable departmental web site for housing 4th year projects and papers published by department members. What are the Computer/Human issues involved? How can the longevity of the site be guaranteed once the author has moved on?

Title: Benchmarking

Level: COSC480 or COSC490

Download and benchmark freely available IR systems against freely available IR datasets. What ranking functions do they use? What measure of "goodness" do they use? How fast are they at indexing and searching? How do recent ranking algorithms perform on the same datasets?

You can do a small amount of work for a COSC480 project or a lot for a COSC490 project. Beware of underestimating the amount of work required.

Title: Web-Spider

Level: COSC480 or COSC490

Each publisher has its own web site, but the journals we read come from multiple publishers. A web-spider can download abstracts from publishers' web sites, convert them to XML and index them in a central database.

Title: Text data mining

Level: COSC490

The ACM publishes conference proceedings each year. Conference papers often cite recent publications, publications that we as academics "should have read". A computer can download these papers, process them and build such a list. What should we have read that we might have missed?

Title: Term Frequency Ranking

Level: COSC480 or COSC490

Many IR systems use the frequency of a term in a document as an integral part of ranking. Term frequency, however, varies with document length. Log frequency is often used to compensate for length, yet there is no mathematical basis for this choice. Generate graphs of term frequency against document length, determine if there is a correlation, and if so how can it be approximated (exponential, polynomial, genetically learned, or whatever)?

Title: Document Frequency Ranking

Level: COSC480 or COSC490

Many IR systems use document frequency in their ranking functions. As the number of times a term occurs in a corpus increases, that term becomes more general to the corpus and therefore less useful as a distinguishing term. Given a corpus and a set of queries, good values for document frequency can be learned using AI techniques. What is the relationship between the learned weights and the document frequency values?

Topics in Data Structures and Algorithms

Title: Have you got the BLAS?

Level: COSC480 or COSC490.

There is a collection of vector and matrix processing procedures called the Basic Linear Algebra Subprograms (BLAS for short). There are three levels: vector operators, vector/matrix operations, and matrix/matrix operations. There is a model implementation in Fortran, and system vendors like Sun, DEC (now HP/Compaq), IBM, and Intel amongst others offer highly tuned implementations for their systems.

The goal of this project is to design, build, and test an object-oriented interface to some subset of the BLAS. Specifically, I would like to be able to call the BLAS from Squeak Smalltalk or GNU Eiffel. Note that the goal is to produce an interface to an existing Fortran or C library, not to re-implement the BLAS.

As a COSC480 project

You will carry out a design for the level 1 BLAS in both Smalltalk and Eiffel. You will comment insightfully on the effects that differences in the languages have on idiomatic designs. You will construct examples showing the use of your designs. Generating really good designs is harder than you might think, especially as I expect that you will need different designs for the two languages. (What does that say about UML?) Implementation is not required for COSC480.

As a COSC490 project

You will carry out a design for the level 1 and level 2 BLAS in either Smalltalk or Eiffel, implement your design, and test it. A good test of test cases is an important part of this project; I suggest using SUnit for Smalltalk or EiffelUnit for Eiffel.

The level 2 and 3 BLAS are concerned with several different array representations, this is why doing the level 2 BLAS is harder than just the level 1 BLAS.

Title: Sets in a High Level Language

Level: COSC480 or COSC490

Sets are important data structures. We would like to have space-efficient representations for them with time-efficient algorithms for performing the basic set operations union, intersection, relative complement, membership, and enumeration of elements, and we'd like to have them for declarative programming languages.

I have six set representations that I would like implemented and benchmarked, or analysed. Two are already implemented in Prolog and Lisp. There are also some interesting data structures in Okasaki's book. This project involves

Choosing a declarative language to work with. Mercury would be preferred, but I'll accept (pure) Common Lisp, (pure) Scheme, ML, O'CAML, Haskell, Clean, (pure) Prolog, or Erlang.
Specifying an interesting set of benchmarks.
Implementing some or all of the representations and benchmarking them. It is more important to have full benchmarks for a few data structures than scraps for many data structures.
If time permits, doing a theoretical analysis of the space and time costs. I will help with this, it's not as big a deal as it sounds.
Analysing the results and writing them up in a report.

Without the theoretical analysis, and perhaps with a limited range of techniques, this could be a COSC 480 project.

With the theoretical analysis, and especially with some designs of your own, this could be a COSC 490 project.

Title: Statistical Calculations

Level: COSC480

If you've ever done any statistics, you'll know about t-tests, chi square, regression, analysis of variance, and so on. According to [Rand R. Wilcox, Fundamentals of Modern Statistical Methods, Springer-Verlag 2001], those techniques are not only 40 years out of date, they are dangerously vulnerable to the quirks of real-life data sets.

We have source code for two free statistics packages, XLispStat and R, and the free MATLAB-like package Octave. Implement some of the techniques described in that book, or any other robust methods, in clear well-documented and tested code for any of those packages.

Title: Tranz Rail Cook Strait Ferry

Level: COSC480 or COSC490

The Cook Strait ferry currently takes the most direct path from Wellington to Picton and back. It is not optimised for wind, rain, weather, tides, and so on. The project is to find a way for a computer to find an optimal (or at any rate better) path. Criteria might include travel time, fuel consumption, or profit. We have a contact at Tranz Rail who can provide information and is interested in seeing this happen.

There are several techniques that might be tried. The Calculus of Variations is one, Evolutionary Algorithms is another.

For a COSC480 project, you might work with a simplified model and a simplified technique. For a COSC490 project, you might work with a more realistic model and/or compare two or more techniques.

Software Engineering

Title: Spelling Checker for Source Code

Level: COSC480

Modify ispell or any other spelling checker of your choice to check the spelling of comments in source code. You may choose the programming language, although C, C++, Java, Smalltalk, Prolog, or Erlang would be preferred. Better still, try to come up with a design that makes it easy to plug in support for other programming languages, though you still need only implement checking for one.

The very best thing would be to recognise identifiers defined in the language standard and declared in the source code as "words" and not report them as errors, but that requires a two-pass algorithm. (Why two passes? Because comments often describe the next declaration.) However, even just checking the commentary as if the source code proper were absent would be very useful.

If the spelling checker you modify is not interactive (and ispell is), take care to produce output in error(1) format so that UNIX tools such as emacs can be used.

Try running your modified spelling checker over an existing body of code. In your report, comment on the practical utility of this tool.

Note: it took me about an hour of reading the ispell sources to find the one small place that needed to be changed. However, ispell is a large program, so you may find it harder than I did. You don't have to use ispell, but you do have to report somehow where the errors are, so spell(1) won't quite do.

This is a software maintenance project and might suit COSC 480. A program for extracting comments in quite a few programming languages is available, so the amount of new code might not be very high.

Title: Technical Writing

Level: COSC480

The best software tools in the world are no use unless people who would benefit from them are told clearly how to use the tools to solve their problems. A good technical writer is one of the most valuable people in a programming team (if you can't explain it, it's probably wrong).

There is a static checker for C called SPlint. If you just give it your C source files, it can find most of the things that lint(1) can. If you give it additional specifications, it can check a lot more for you. However, the existing documentation takes it for granted that you already understand a lot of this, or have the patience to read a couple of books.

A user manual, which started with what SPlint can do for you with no extra specifications, and introduced each additional thing you can do, with examples, would make this tool much more accessible. I envisage a document somewhere between 60 and 120 pages. Perhaps half the bulk of the text would be examples. The document should be written using LaTeX, and should have an excellent index and table of contents. I will be able to answer most questions about the program, and we have a good e-mail relationship with the maintainer of the program.

Compilers

Title: C Compiler Maintenance

Level: COSC490

There is a free C compiler called lcc. The source code has been published and explained in a book, which I can lend you. It handles all of C89, but since then, several extensions have been added, and there is now a C99 standard.

Pick a small number of new features, such as

digraphs (alternatives for some of C's symbols, useful for people using a 7-bit non-ASCII character set),
Universal Character Names (the ability to write identifiers using Unicode, and to encode Unicode characters as \uxxxx even in an identifier)
declarations as statements (copied from C++)
flexible arrays in records (replaces the "struct hack")
variable size arrays (brings C up to date with the 1960's)

implement some of them (you won't have time to do as many as you think), and test them. Write up what you have done and what you have learned. Send the lot to the lcc repository. This is a COSC 490 project.

Title: AWK for SGML

Level: COSC480

AWK is a UNIX programming language for manipulating files that can be viewed as sequences of lines. I've done some SGML manipulation in AWK, and from that have designed an AWK-like language with direct support for SGML. I also have an unfinished compiler for this language, in Prolog, about 3,600 SLOC.

There are at least two possible projects here:

Help me finish the compiler.
Study test generation and use the grammar to automatically derive lots of test cases.

Richard O'Keefe's COSC480/490 full list for 2004

Topics about XML

Title: XML Query Languages

Level: COSC480 or COSC490

As a COSC480 project

As a COSC490 project in formal methods

As a COSC490 project in implementation

Title: XML Document Type Definition induction

Level: COSC480 or COSC490

Title: Understanding XPath

Level: COSC490

Title: Implement or analyse my XML Query Language

Level: COSC480 or COSC490

Title: A Simple XML query language

Level: COSC490

Title: Explaining Cascading Style Sheets

Level: COSC490

Title: Implementing Unicode

Level: COSC480 or COSC490

Topics in natural language processing

Title: Finding sentence boundaries

Level: COSC490

Title: Project Gutenberg auto-markup

Level: COSC480 or COSC490

Title: Tagging Mäori with Hidden Markov Models

Level: COSC490

Title: Morphological Analyser for some language

Level: COSC490

Topics in Information Retrieval

Title: Thesaurus Searching

Level: COSC490

Title: Stemming for Information Retrieval in some language

Level: COSC480 or COSC490

Title: 4th Year Project Library

Level: COSC490

Title: Benchmarking

Level: COSC480 or COSC490

Title: Web-Spider

Level: COSC480 or COSC490

Title: Text data mining

Level: COSC490

Title: Term Frequency Ranking

Level: COSC480 or COSC490

Title: Document Frequency Ranking

Level: COSC480 or COSC490

Topics in Data Structures and Algorithms

Title: Have you got the BLAS?

Level: COSC480 or COSC490.

As a COSC480 project

As a COSC490 project

Title: Sets in a High Level Language

Level: COSC480 or COSC490

Title: Statistical Calculations

Level: COSC480

Title: Tranz Rail Cook Strait Ferry

Level: COSC480 or COSC490

Software Engineering

Title: Spelling Checker for Source Code

Level: COSC480

Title: Technical Writing

Level: COSC480

Compilers

Title: C Compiler Maintenance

Level: COSC490

Title: AWK for SGML

Level: COSC480

Other topics

Title: MIDI Classification

Level: COSC480

Title: Data Mining

Level: COSC480

Title: Your topic

Level: COSC480 or COSC490