COSC431

Information Retrieval

XML and Regular Expressions

XML

You will find out about XML at the World Wide Web consortium web site: http://www.w3.org. To find XML specifications, look at the XML Core page. For our purposes, the key specification is Extensible Markup Language (XML) 1.0.

Basically, XML addresses two problems:

  • Character encoding
  • Document structure

The character encoding problem has two parts. The first is that there are literally hundreds of different character sets, many still in use. The Internet register of character sets lists 850 names for 254 character sets, and it is far from complete. More precisely, what it names is not character sets, but encodings: UTF-32, UTF-16, UTF-8, SCSU (Simple Compression Scheme for Unicode), and BOCU (Binary Ordered Compression for Unicode) are all ways of encoding sequences of (21-bit) Unicode code-points as streams of bytes. The basic issue is that you cannot read the words until you can read the letters, and you can't read the letters without knowing the encoding.

How can you know what the encoding of a document is?

  • System-wide convention
    For example, in UNIX I typically have
    export LC_CTYPE=en_NZ.ISO8859-1
    which says that my (LoCale)'s Character TYPE rules are those for the dialect of english written in New Zealand using member number 1 of the ISO 8859 family of 8-bit coded character sets.
    On some other systems we might find
    export LANG=en_GB.UTF-8
    which says the version of english written in Great Britain encoded in UTF-8. All of this is fine for files you have written, or other people have written using the same conventions. It won't do if you have documents from other sources. They might be fetched over the internet, but when you have users sharing a file system, they might have different preferences.

  • File-system metadata
    The file system might keep “which encoding” as a property of a file, just as the Burroughs MCP operating system did back in the 1960s. The Windows NTFS, the MacOS X file system, and dear old BeOS can do this. Some other file systems cannot.

  • Out-of-band information
    When you fetch a document from a web server, it should normally say what the character set is. HTTP 1.1 says that a document is returned with a header giving a bunch of handy information. One piece of information is
    Content-Type: type; charset=character-set

    where type is something like text/html, and character set is something like ISO-8859-1, which is the default that a client must assume if it is not specified. If that information is there, wonderful. If not, you can't really rely on it being ISO Latin 1, because the server might not know or (if a Microsoft server) might lie (Windows 1252 is often misreported as Latin 1).

    Of course, once you save the file to disc, this information may be lost.

  • In-band information
    Sadly, the only really reliable way to keep track of how a document is encoded is to put the information in the document itself. In the bad old days, programmers would often ship programs with a special
    /* 0123456789.-+ABCDEFGHIJKLMNOPQRSTUVWYYZ_&=$ */

    line listing all the characters that might be used in the file (here, the characters used in Fortran 66, plus dollar) in a known order, so that if you didn't see those characters, you knew what translation table to build.

    These days, XML files begin with an XML declaration that says which version of XML (there are two so far, XML 1.0 (5th edition) and XML 1.1 (2nd edition)). XML 1.1's major change is going to be folded back into XML 1.0 fifth edition. But the XML declaration also says what the encoding is. For example,

    <?xml version="1.0" encoding="ASCII"?>

    Of course, in order to read the encoding declaration, you have to be able to read the characters, which seems to mean that you have to know what the encoding is, and if you know that, why do you need the encoding declaration? The answer is that there is a limited number of ways that an XML file can begin. The first character should normally be either a Unicode “Byte Order Mark” or a “<”. While there are hundreds of 8-bit character sets, almost all of them are extensions or revisions of either ASCII or EBCDIC, and the “<” character code will tell you which of those it is. The encoding declaration is written in the ASCII subset of Unicode (using only characters that are also in EBCDIC). Annex F of the XML specification says how to auto-detect enough of the encoding to read the rest.

For your assignments, you may assume UTF-8. In fact the Wall Street Journal collection is entirely ASCII.

The second way that XML helps with character encoding issues is by providing a way to enter any Unicode character, no matter what character set you are using for the document. This means that documents can be translated from any character set to any other; characters not in the target can be written otherwise.

One method is to use names. There are five standard names in XML:

amp&the ampersand
apos'the apostrophe
gt>greater than
lt<less than
quot"the double quotation mark

Applications of XML may declare other named characters in things called DTDs (Document Type Definitions). XHTML defines hundreds, such as &uarr for the up arrow, ↑

Less conveniently, but more generally, characters may be written in decimal “&#ddd;” or hexadecimal “&#xhhh;”.

Any program that reads XML must be prepared to decode the five standard named characters and decimal and hexadecimal character references.

The other thing XML deals with is document structure. An XML document is basically a tree, where some leaves are text (with named characters, numeric character references, and perhaps even comments) embedded, and nodes (elements) have labels (historically called Generic Identifiers, sometimes called element type names) and unordered attribute=value sets. Bending the idea of a tree a bit, there may also be cross-links. A declaration

<!ATTLIST element attribute ID #IMPLIED>

says that any instance of element may but need not have the given attribute, and that that attribute is a unique identifier (ID) for that instance. All the ID values in a document must be different, and they must look like words. A declaration

<!ATTLIST element attribute IDREF #IMPLIED>

declares an identifier reference attribute; its value must be one of the IDs in a document. For example, we might have

<!ELEMENT person (#PCDATA)>
<!ATTLIST person ID #REQUIRED
          spouse IDREF #IMPLIED>
...
<person id="joe99" spouse="mary12">Mengele, Joseph</person>
<person id="mry12" spouse="joe19">Mallon, Mary</person>
<person id="rod.6">Borgia, Rodrigo</person>

There are two basic ways of processing XML.

  • In the DOM (Document Object Model) approach, you are given the whole document in memory at once as a collection of objects linked together by pointers. This comes from Netscape, where it was the underpinning of Javascript and Dynamic HTML, but is now a W3C recommendation.

    This is a very simple way to deal with XML, or would be if the interface weren't barking mad.

  • In the SAX (Simple API for XML) approach, you are given a series of events (enter an element of this type, leave an element of this type, here is some text, here is a comment, &c). Typically this is done by some kind of “callback” machinery. This approach is very well suited for handling extremely large volumes of data, especially where you are not interested in most of it, but it may require you to keep track of information that you'd expect the parser to keep track of.

    Many XML parsers will give you a stream of events as a text file in ESIS (Element Structure Information Set) format. My qh program does this, for example.

The Wall Street Journal collection is made of files like this:

<DOC>
 <DOCNO>
  document number
 </DOCNO>
 <DOCID>
  document identifier
 </DOCID>
 ... other stuff ...
</DOC>

You want to remember the document number (or identifier; check the assignment!) for reporting; you want to look for words in the “... other stuff ...”.

Regular Expressions

If you want to process text, you have to learn about regular expressions. If you want to be an effective UNIX or Java programmer, you have to learn about regular expressions. On a UNIX system, you should read the lex(1) and/or flex(1) manual pages, also regcmp(3) -- traditional UNIX regular expression functions -- and regcomp(3) -- POSIX.2 regular expression functions -- and of course grep(1). For Java, it's all there in the Java API JavaDocs. There is a freely available C library providing extended regular expressions; it's called PCRE, for PERL-Compatible Regular Expressions.

In the lecture I describe the syntax of basic regular expressions and how to map them directly onto C provided they satisfy a one-character lookahead condition. Here I am going to be a little bit more formal about it. I'm going to define Empty(e) --- can e match the empty string --- and First(e) --- the set of characters that any non-empty match for e must begin with.

For the C translations, I shall assume the following declarations:

int c;  /* unsigned character or EOF */
int get(void);  /* return next character, or EOF if none */
void fail(void); /* report match failure */

The C translation of a regular expression will start with the current character (or EOF if there is none) in c.

  • There is a regular expression primitive that matches the empty string. It's often written as an epsilon. In UNIX regular expressions, when allowed, it is written as an empty expression, that is, as nothing. Here I shall use 0.

    Empty(0) = true.

    First(0) = {}, the empty set of characters.

    C code: no code required.

  • If k is an ordinary character, then k is a regular expression that matches the string containing k and nothing else. In UNIX regular expressions, certain punctuation marks, ( ) [ ] { } | ^ $ ? + * . \ need to be protected by a backslash; to match the character “?” you need to write “\?”. Confusingly, the set of characters needing this protection varies from tool to tool. In some programs, you must protect ( ) this way, in others, you must not.

    Empty(k) = false.

    First(k) = {k}.

    C code:

            if (c != 'k') fail();
            c = get();
        
  • UNIX regular expressions let you match any one of a set of characters. A single dot matches any character (some tools) or any character except a newline character (other tools, notably lex(1)).

    Empty(.) = false.

    First(.) = the whole character set

    C code:

            if (c == EOF) fail();
            c = get();
        
    You may also provide a set of characters between square brackets.

    Empty([k1,...,kn]) = false.

    First([k1,...,kn]) = {k1,...,kn}.

    C code:

            if (c != k1 && ... && c != kn) fail();
            c = get();
        

    The classification functions described in 'man ctype' are also available, [[:alpha:]] is the set of letters, for example.

    You can also ask for any character that is not in a list of characters by placing ^ (a circumflex) at the beginning of the list.

    Empty([^k1,...,kn]) = false.

    First([^k1,...,kn]) = the whole character set \ {k1,...,kn}.

    C code:

            if (c == k1 || ... || c == kn) fail();
    	if (c == EOF) fail(); /* must match a real character */
            c = get();
        
  • To match a sequence of things, we juxtapose the regular expressions. (Sometimes people use an infix dot for this.)

    Empty(p1p2) =Empty(p1) & Empty(p2).

    First(p1p2) = if Empty(p1) then First(p1) U First(p2) else First(p1).

    This clearly generalises to any number of patterns.

    C code (where if Empty(p1), we require that First(p1) and First(p2) have no element in common):

            code for p1
            code for p2
        
  • To match any one of several things, we use the “alternation” operator “|”. In terms of operator precedence, this binds less tightly than juxtaposition, just as addition binds less tightly than multiplication.

    Empty((p1|p2)) = Empty(p1) | Empty(p2).

    First((p1|p2)) = First(p1) U First(p2).

    This clearly generalises to any number of patterns.

    The C translation I'm telling you about has a one-character lookahead requirement. That is, whenever we have a choice, we must be able to tell just by looking at one next character. Regular expressions as such do not have any such requirement; it is only the “direct” translation technique I'm describing that has this limitation. If we were talking about context-free grammars, this would be the LL(1) condition.

    C code, provided not Empty(p1) and not Empty(p2) and First(p1) has no elements in common with First(p2):

            if (p1 can begin with c) {
                code for p1
    	} else
    	if (p2 can begin with c) {
    	    code for p2
    	} else {
    	    fail();
    	}
        
  • There is a special case of alternation/choice which is common enough to warrant its own notation, and that is an optional pattern. (p|0) may be written p?. From the previous definitions, we must have

    Empty(p?) = true.

    First(p?) = First(p).

    C code:

            if (p can begin with c) {
                code for p
    	}
        
  • Repeating a pattern a fixed number of times is equivalent to concatenating it. This is commonly written pn. Assuming n is greater than one,

    Empty(pn) = Empty(p).

    First(pn) = First(p).

    C code (same conditions as p?):

            for (i = 0; i < n; i++) {
                code for p
    	}
        
  • Repeating a pattern zero or more times has a special name, “Kleene star”. It is written p*.

    Empty(p*) = true.

    First(p*) = First(p).

    C code (same conditions as p?):

            while (c in First(p)) {
                code for p
    	}
        
  • Repeating a patteron one or more times also has a special notation. It is the plus sign: p+.

    Empty(p*) = Empty(p).

    First(p*) = First(p).

    C code (same conditions as p?):

    	do {
                code for p
    	} while (c in First(p));
        

An example

Suppose we want to handle simple XML markup. (This is really simple, OK? It is just an example. But oops, it is all that the WSJ needs...) Simple XML is a sequence of

  • start tags (<[a-zA-Z][a-zA-Z0-9]*>),
  • end tags (</a-zA-Z][a-zA-Z0-9]*>),
  • empty tags (<[a-zA-Z][a-zA-Z0-9]*/>),
  • escaped less than characters (&lt;),
  • escaped ampersands (&amp;),
  • any other character.

To strip out markup, we want to
  • replace tags by newline characters,
  • replace &lt; by '<',
  • replace &amp; by '&', and
  • copy any other character unchanged.

To satisfy the one-character lookahead condition, we have to express this as
    (<(/[a-zA-Z][a-zA-Z0-9]*>
      |[a-zA-Z][a-zA-Z0-9]*/?>
      )
    |&(lt;|amp;)
    |[^<&]
    )*

This translates to the following C code:

c = get(); /* don't forget this! */
while (c != EOF) {
    if (c == '<) {
	c = get();
	if (c == '/') {
	    c = get();
	    if (!isalpha(c)) fail();
	    c = get();
	    while (isalnum(c)) c = get();
	    if (c != '>') fail();
	    c = get();
	} else {
	    if (!isalpha(c)) fail();
	    c = get();
	    while (isalnum(c)) c = get();
	    if (c == '/') c = get();
	    if (c != '>') fail();
	    c = get();
	}
	putchar('\n');
    } else
    if (c == '&') {
	c = get();
	if (c == 'a') {
	    c = get();
	    if (c != 'm') fail();
	    c = get();
	    if (c != 'p') fail();
	    c = get();
	    if (c != ';') fail();
	    c = get();
	    putchar('&');
	} else
	if (c == 'l') {
	    c = get();
	    if (c != 't') fail();
	    c = get();
	    if (c != ';') fail();
	    c = get();
	    putchar('<');
	} else {
	    fail();
	}
    } else {
	putchar(c);
	c = get();
    }
}

Here the boldface lines are the actions we want to perform; everything else follows from the direct translation.

Of course we can do this in less hand-written code using a tool that understands regular expressions. For example,

%%
"<"[a-zA-Z][a-zA-Z0-9]*">"  |
"<"[a-zA-Z][a-zA-Z0-9]*"/>" |
"</"[a-zA-Z][a-zA-Z0-9]*">" { putchar('\n'); }
"&lt;"                      { putchar('<'); }
"&amp;"                     { putchar('&'); }
.                           { putchar(yytext[0]); }

is a complete lex(1) program for this problem. It's pretty dumb, but it's good enough for the body of this web page. (It can't handle the DOCTYPE line, the comment at the top, or the META tag, which has parameters.)

Of course, any amount of white space is allowed before the closing '>' of a tag, and tag names may include underscores, hyphens, and dots, so a better version would be

%%
"<""/"?[a-zA-Z][-_.a-zA-Z0-9]*[ \t\n]*"/"?">" { putchar('\n'); }
"&lt;"  { putchar('<'); }
"&amp;" { putchar('&'); }
.       { putchar(yytext[0]); }

Changing the lex code is much easier than changing the C code above (but you should try it as an exercise). Even when changing the C code, it is handy to have the regular expression for reference.

Beware: this is good enough for the body of this web page, but that is all. For serious work, you'll have to extend it.

Who is in charge?

The theory of turning unrestricted regular expressions into code works through several stages:

  1. given a regular expression,
  2. convert it to a Non-deterministic Finite-state Automaton (NFA),
  3. then convert that to a Deterministic Finite-state Automaton (DFA),
  4. and finally turn that into tables and code.

The point of a DFA is that the code is very simple:

state = INITIAL_STATE;
do {
    c = get();
    switch (state = new_state[state][c]) {
        case SOME_STATE: user_action(); break;
        ...
    }
} while (state != FINAL_STATE);

There is an important distinction here concerning who is in charge. In the translation into ordinary code I offered above, that code is in charge of reading characters: characters are pulled into the regexp code, and the state of the automaton basically is the program counter of the PC. In many applications, it is easier for the system as a whole if the characters are pushed into the regexp code, and the state of the automaton has to be explicit as a variable.

Consider breaking a file into words very very crudely. We are going to say that a word is a sequence of letters that is as long as possible. The regular expression is

    ([[:alpha:]]+|[^[:alpha:]])*

In pull mode, we'd write

    c = get();
    while (c != EOF) {
        if (isalpha(c)) {
            begin_word();
            do {
		add_to_word(c);
		c = get();
	    } while (isalpha(c));
	    end_word();
	} else {
	    c = get();
	}
    }

In push mode, we'd have to write

enum {NOT_IN_WORD, IN_WORD} state = NOT_IN_WORD;

void accept(int c) {
    switch (state) {
        case NOT_IN_WORD:
            if (isalpha(c)) {
                begin_word();
                add_to_word(c);
                state = IN_WORD;
	    } else {
	        state = NOT_IN_WORD; // optimise away!
	    break;
	case IN_WORD:
	    if (isalpha(c)) {
	        add_to_word(c);
	        state = IN_WORD; // optimise away!
	    } else {
	        end_word();
	        state = NOT_IN_WORD;
	    }
	    break;
    }
}

In your situation, it is easiest to use pull mode for recognising and discarding XML markup, and push mode for breaking the remaining text into words.

Resources

Any Unix system should have lex(1) installed, making it easy to write text scanning code in C. flex(1) is an open source reimplementation of lex with extra features. In many systems lex is just a link to flex. JFlex is an open source scanner generator for Java, modelled on flex.