XML and Regular Expressions

XML

You will find out about XML at the World Wide Web consortium web site: http://www.w3.org. To find XML specifications, look at the XML Core page. For our purposes, the key specification is Extensible Markup Language (XML) 1.0.

Basically, XML addresses two problems:

The character encoding problem has two parts. The first is that there are literally hundreds of different character sets, many still in use. The Internet register of character sets lists 850 names for 254 character sets, and it is far from complete. More precisely, what it names is not character sets, but encodings: UTF-32, UTF-16, UTF-8, SCSU (Simple Compression Scheme for Unicode), and BOCU (Binary Ordered Compression for Unicode) are all ways of encoding sequences of (21-bit) Unicode code-points as streams of bytes. The basic issue is that you cannot read the words until you can read the letters, and you can't read the letters without knowing the encoding.

How can you know what the encoding of a document is?

System-wide convention
For example, in UNIX I typically have
setenv LC_CTYPE en_NZ.ISO8859-1

which says that my (LoCale)'s Character TYPE rules are those for the dialect of en(glish) written in New Zealand using member number 1 of the ISO 8859 family of 8-bit coded character sets. On my new Macintosh we find

setenv LANG en_GB.UTF-8

which says the version of English written in Great Britain encoded in UTF-8. All of this is fine for files you have written, or other people have written using the same conventions. It won't do if you have documents from other sources. They might be fetched over the internet, but when you have users sharing a file system, they might have different preferences.

File-system metadata
The file system might keep "which encoding" as a property of a file, just as the Burroughs MCP operating system did back in the 1960s. The Windows NTFS, the MacOS X file system, and dear old BeOS can do this. Some other file systems cannot.
Out-of-band information
When you fetch a document from a web server, it should normally say what the character set is. HTTP 1.1 says that a document is returned with a header giving a bunch of handy information. One piece of information is
Content-Type: type; charset=character-set

where type is something like text/html, and character set is something like ISO-8859-1, which is the default that a client must assume if it is not specified. If that information is there, wonderful. If not, you can't really rely on it being ISO Latin 1, because the server might not know or (if a Microsoft server) might lie (Windows 1252 is often misreported as Latin 1).

Of course, once you save the file to disc, this information may be lost.

In-band information
Sadly, the only really reliable way to keep track of how a document is encoded is to put the information in the document itself. In the bad old days, programmers would often ship programs with a special
/* 0123456789.-+ABCDEFGHIJKLMNOPQRSTUVWYYZ_&=$ */

line listing all the characters that might be used in the file (here, the characters used in Fortran 66, plus dollar) in a known order, so that if you didn't see those characters, you knew what translation table to build.

These days, XML files begin with an XML declaration that says which version of XML (there are two so far, XML 1.0 (4th edition) and XML 1.1 (2nd edition)). XML 1.1's major change is going to be folded back into XML 1.0 fifth edition. But the XML declaration also says what the encoding is. For example,

<?xml version="1.0" encoding="ASCII"?>

Of course, in order to read the encoding declaration, you have to be able to read the characters, which seems to mean that you have to know what the encoding is, and if you know that, why do you need the encoding declaration? The answer is that there is a limited number of ways that an XML file can begin. The first character should normally be either a Unicode "Byte Order Mark" or a "<". While there are hundreds of 8-bit character sets, almost all of them are extensions or revisions of either ASCII or EBCDIC, and the "<" character code will tell you which of those it is. The encoding declaration is written in the ASCII subset of Unicode (using only characters that are also in EBCDIC). Annex F of the XML specification says how to auto-detect enough of the encoding to read the rest.

For your assignments, you may assume UTF-8.

The second way that XML helps with character encoding issues is by providing a way to enter any Unicode character, no matter what character set you are using for the document. This means that documents can be translated from any character set to any other; characters not in the target can be written otherwise.

One method is to use names. There are five standard names in XML:
amp&the ampersand
apos'the apostrophe
gt>greater than
lt<less than
quot"the double quotation mark

Applications of XML may declare other named characters in things called DTDs (Document Type Definitions). XHTML defines hundreds, such as &uarr for the up arrow, ↑

Less conveniently, but more generally, characters may be written in decimal "&#ddd;" or hexadecimal "&#xhhh;".

Any program that reads XML must be prepared to decode the five standard named characters and decimal and hexadecimal character references.

The other thing XML deals with is document structure. An XML document is basically a tree, where some leaves are text (with named characters, numeric character references, and perhaps even comments) embedded, and nodes (elements) have labels (historically called Generic Identifiers, sometimes called element type names) and unordered attribute=value sets. Bending the idea of a tree a bit, there may also be cross-links. A declaration

<!ATTLIST element attribute ID #IMPLIED>

says that any instance of element may but need not have the given attribute, and that that attribute is a unique identifier (ID) for that instance. All the ID values in a document must be different, and they must look like words. A declaration

<!ATTLIST element attribute IDREF #IMPLIED>

declares an identifier reference attribute; its value must be one of the IDs in a document. For example, we might have

<!ELEMENT person (#PCDATA)>
<!ATTLIST person ID #REQUIRED
          spouse IDREF #IMPLIED>
...
<person id="joe99" spouse="mary12">Mengele, Joseph</person>
<person id="mry12" spouse="joe19">Mallon, Mary</person>
<person id="rod.6">Borgia, Rodrigo</person>

There are two basic ways of processing XML.

The Wall Street Journal collection is made of files like this:

<DOC>
 <DOCNO>
  <i>document number</i>
 </DOCNO>
 <DOCID>
  <i>document identifier</i>
 </DOCID>
 ... other stuff ...
</DOC>

You want to remember the document number for reporting; you want to look for words in the "... other stuff ...".

Regular Expressions

If you want to process text, you have to learn about regular expressions. If you want to be an effective UNIX or Java programmer, you have to learn about regular expressions. On a UNIX system, you should read the lex(1) and/or flex(1) manual pages, also regcmp(3) -- traditional UNIX regular expression functions -- and regcomp(3) -- POSIX.2 regular expression functions -- and of course grep(1). For Java, it's all there in the Java API JavaDocs. There is a freely available C library providing extended regular expressions; it's called PCRE, for PERL-Compatible Regular Expressions.

In the lecture I describe the syntax of basic regular expressions and how to map them directly onto C provided they satisfy a one-character lookahead condition. Here I am going to be a little bit more formal about it. I'm going to define Empty(e) --- can e match the empty string --- and First(e) --- the set of characters that any non-empty match for e must begin with.

For the C translations, I shall assume the following declarations:

int c;  /* unsigned character or EOF */
int get(void);  /* return next character, or EOF if none */
void fail(void); /* report match failure */

The C translation of a regular expression will start with the current character (or EOF if there is none) in c.

An example

Suppose we want to handle simple XML markup. (This is really simple, OK? It is just an example.) Simple XML is a sequence of

To strip out markup, we want to

To satisfy the one-character lookahead condition, we have to express this as

    (<(/[a-zA-Z][a-zA-Z0-9]*>
      |[a-zA-Z][a-zA-Z0-9]*/?>
      )
    |&(lt;|amp;)
    |[^<&]
    )*

This translates to the following C code:

c = get(); /* don't forget this! */
while (c != EOF) {
    if (c == '<) {
	c = get();
	if (c == '/') {
	    c = get();
	    if (!isalpha(c)) fail();
	    c = get();
	    while (isalnum(c)) c = get();
	    if (c != '>') fail();
	    c = get();
	} else {
	    if (!isalpha(c)) fail();
	    c = get();
	    while (isalnum(c)) c = get();
	    if (c == '/') c = get();
	    if (c != '>') fail();
	    c = get();
	}
	putchar('\n');
    } else
    if (c == '&') {
	c = get();
	if (c == 'a') {
	    c = get();
	    if (c != 'm') fail();
	    c = get();
	    if (c != 'p') fail();
	    c = get();
	    if (c != ';') fail();
	    c = get();
	    putchar('&');
	} else
	if (c == 'l') {
	    c = get();
	    if (c != 't') fail();
	    c = get();
	    if (c != ';') fail();
	    c = get();
	    putchar('<');
	} else {
	    fail();
	}
    } else {
	putchar(c);
	c = get();
    }
}

Here the boldface lines are the actions we want to perform; everything else follows from the direct translation.

Of course we can do this in less hand-written code using a tool that understands regular expressions. For example,

%%
"<"[a-zA-Z][a-zA-Z0-9]*">"  |
"<"[a-zA-Z][a-zA-Z0-9]*"/>" |
"</"[a-zA-Z][a-zA-Z0-9]*">" { putchar('\n'); }
"&lt;"                      { putchar('<'); }
"&amp;"                     { putchar('&'); }
.                           { putchar(yytext[0]); }

is a complete lex(1) program for this problem. It's pretty dumb, but it's good enough for the body of this web page. (It can't handle the DOCTYPE line, the comment at the top, or the META tag, which has parameters.)

Of course, any amount of white space is allowed before the closing '>' of a tag, and tag names may include underscores, hyphens, and dots, so a better version would be

%%
"<""/"?[a-zA-Z][-_.a-zA-Z0-9]*[ \t\n]*"/"?">" { putchar('\n'); }
"&lt;"  { putchar('<'); }
"&amp;" { putchar('&'); }
.       { putchar(yytext[0]); }

Changing the lex code is much easier than changing the C code above (but you should try it as an exercise). Even when changing the C code, it is handy to have the regular expression for reference.

Beware: this is good enough for the body of this web page, but that is all. For serious work, you'll have to extend it.

Who is in charge?

The theory of turning unrestricted regular expressions into code works through several stages:

  1. given a regular expression,
  2. convert it to a Non-deterministic Finite-state Automaton (NFA),
  3. then convert that to a Deterministic Finite-state Automaton (DFA),
  4. and finally turn that into tables and code.

The point of a DFA is that the code is very simple:

state = INITIAL_STATE;
do {
    c = get();
    switch (state = new_state[state][c]) {
        case SOME_STATE: user_action(); break;
        ...
    }
} while (state != FINAL_STATE);

There is an important distinction here concerning who is in charge. In the translation into ordinary code I offered above, that code is in charge of reading characters: characters are pulled into the regexp code, and the state of the automaton basically is the program counter of the PC. In many applications, it is easier for the system as a whole if the characters are pushed into the regexp code, and the state of the automaton has to be explicit as a variable.

Consider breaking a file into words very very crudely. We are going to say that a word is a sequence of letters that is as long as possible. The regular expression is

    ([[:alpha:]]+|[^[:alpha:]])*

In pull mode, we'd write

    c = get();
    while (c != EOF) {
        if (isalpha(c)) {
            begin_word();
            do {
		add_to_word(c);
		c = get();
	    } while (isalpha(c));
	    end_word();
	} else {
	    c = get();
	}
    }

In push mode, we'd have to write

enum {NOT_IN_WORD, IN_WORD} state = NOT_IN_WORD;

void accept(int c) {
    switch (state) {
        case NOT_IN_WORD:
            if (isalpha(c)) {
                begin_word();
                add_to_word(c);
                state = IN_WORD;
	    } else {
	        state = NOT_IN_WORD; // optimise away!
	    break;
	case IN_WORD:
	    if (isalpha(c)) {
	        add_to_word(c);
	        state = IN_WORD; // optimise away!
	    } else {
	        end_word();
	        state = NOT_IN_WORD;
	    }
	    break;
    }
}

In your situation, it is easiest to use pull mode for recognising and discarding XML markup, and push mode for breaking the remaining text into words.