COSC 463 Assignment 1 — 2010

The aims of this assignment

The aims of this assignment include

Resources

For this assignment, I have provided three books marked up in XML. I got the books as plain text from Project Gutenberg, and added the XML markup myself. (Some of the markup was automated, but by no means all of it.) These books were chosen to be different in topic and style, not because I endorse the contents of any of them.

I'll ask you to count all words, all content words, and all function words. I've provided a little AWK program filter.awk that in its present form selects the function words from a token stream on standard input and copies them to standard output. You can easily change this, by adding a "!" in the right place, to a program that selects content words, and by removing a test, to a program that selects alphabetic words whatever they are.

I'm also providing a part-of-speech tagger called QTag. It's a Java program, and the author does not provide sources for it. There is a newer version than the one I provide, but this version seems to be simpler to use.

You are asked to plot some graphs. You may use any software you like for this, but I recommend R, which is already on the lab machines.

What must you do

  1. Decide what is to count as a word. Tell me in plain text, and again by writing a regular expression. In particular, you must decide what to do about hyphens, what to do about apostrophes, what to do about abbreviation points, what to do about alphabetic case, what to do about numbers, and if you decide to treat numbers as words, what to do about decimal points, thousands separators, and fractions like ½ and 22/7.
  2. Convert each of the books to word sequences, one word in or one punctuation mark per line. Note that some punctuation marks (like dashes) may be written as more than one character (two dashes, for example).
  3. Count Words

    Given a file with one word per line, we can select those lines that are not numbers or punctuation marks by doing

    grep '^[a-zA-Z]' <tokens >words
    

    Alternatively, as noted above, you can use a copy of filter.awk with the is-a-function-word test deleted:

    awk -f filter.awk <tokens >words
    

    To find out how often things occur, we can bring equal ones together using sort(1) and then count them using uniq(1) with the -c option. To find out the commonest ones, we can then sort the output of uniq(1) into descending order by number, and take the top N. So to find the top 30 words,

    grep '^[a-zA-Z]' <tokens | sort | uniq -c |
    sort -nr | head -30
    

    I want you to count all words, to count function words, and to count content words (defined as any words other than function words). Make three copies of filter.awk: filter-all.awk, filter-function.awk, and filter-content.awk. Make the appropriate changes to filter-content.awk and filter-all.awk.

    I want you to

    1. For each book, find the 30 most frequent words. Produce a single page showing
              Wallace   Doyle     Wittgenstein
      	%1  $1	  %1' $1'   %1" $1"
      	...
      	%30 $30   %30' $30' %30" $30"
      
      where the %i entries are the percentage of all word occurrences represented by the ith word, and the $i entries are the ith word, for each document.

      Comment on what you find.

    2. For each book, find, display, and comment on, the 30 most common content words.
    3. For each book, find, display, and comment on, the 30 most common function words.
    4. For each book, find the full list of all words with their counts (grep, sort, uniq -c, sort) and throw away the words, just keeping the counts. You can use
      awk '{print $1}'
      
      to copy the counts to standard output. According to Zipf's law, the frequency of the kth word in the list should be proportional to 1/k.

      For each of the three books, plot a graph with log10(number of occurrences) as the vertical axis and log10(position in list) as the horizontal axis. Do the graphs have the same shape? What are their shapes? How well do these documents obey Zipf's law?

    5. Look up "Heaps' law". What does it predict about these documents? Is that prediction true?
    6. For each of the tokenised documents, run it through the part of speech tagger. We just want the tags.
      your-program <book |
      java -jar qtag.tar qtag-eng -f tab /dev/stdin |
      awk '{print $2}' >book.pos
      
      For each book, find, display, and comment on, the 30 most common part of speech tags. The question I'm interested in here is "each book has its own vocabulary, does it look as though they have their own (sub)grammar?"
    7. Do the part of speech tags follow Zipf's law? What is it about the set of tags that means we should expect that?