Words

Why we care about words

Information retrieval models documents and queries as collections of words, not because there is no more information to be had than that, but because it is very hard to get it. Natural language processing is coming on amazingly, but there are vast amounts of text to be indexed, and we just cannot afford much processing yet.

As you are about to see, even deciding what is a word is challenging.

What is a word?

Let's look at some example text. Turn to http://en.wikipedia.org/wiki/Relational_algebra.

Is "http://en.wikipedia.org/wiki/Relational_algebra" a word?

Is "Wikipedia" a word? Would you find it in a dictionary?

Is "first-order" a word? Or is it two words? When is a blackboard not a black board? Is "Blackboard" (see www.blackboard.com about black boards, blackboards, or neither? When is a blackbird not a black bird? (Hint: there are two answers: "blackbirders" were not interested in feathered creatures.)

Is "θ-join" a word? Since when has "θ" been a letter in the English alphabet?

Is "equijoin" a word? What kind of word is it?

In "E.F. Codd's", "Codd" is a surname, so a word. Is "Codd's" a word? What about "E." and "F."

Suppose there were a pair of sentences "No man admires his work more than I. Codd, however, disagrees." How do you know it's not one sentence talking about somebody called "I. Codd"?

Is "ISBL" a word? (Similar questions about "UNESCO", "NZ".)

In the definition of Cartesian product, are cross, cup, and element words? If they are, why aren't curly braces, vertical bar, and comma words, or are they? Can you imagine someone looking for documents containing commas? Can you imagine someone looking for documents containing the element symbol?

Is σφ one word or two? Is it the same as σφ or different?

In the first row of the Employee table, we see "Harry 3415 Finance". Is 3415 a word? If in another table we see "3415 El Camino Real" as part of an address, is that the same word as Harry's employee id, or another? What if the address were "3415B El Camino Real"?

In the Car table, we see weirdos like "20'000" and "50'000". Are these things words? What do they mean? Why are they written in that bizarre fashion?

Is "$20,000" a word? Is it the same word as "$20000"? Is it the same word as "$20,000.00"? Is it the same word as "NZD 20000"?

Speaking of words with spaces in them, if you are a native speaker of English, et cetera, force majeur, tour de force, fait accompli and so on are almost certainly single entries in your personal lexicon. You may not, for example, know that tour means "turn" and force means "strength". You're not going to use the components of these phrases in the same free combinatorial way that you use the components of English phrases. Tour de vitesse, force moindre, and the like will not spring to your lips. So is tour de force one word or three?

Come to that, English has a number of stock phrases that don't follow modern English grammar: durance vile, court martial. (These have Noun Adjective order, whereas Modern English has Adjective Noun.) Are these one word or two?

That page has a sidebar (or is that "side bar" or "side-bar"?) header "languages". Since when has "Deutsch" been an English word? Is that side bar part of the content of the page or simply part of the framing? (No, this page does not use frames.)

If a web page contains an embedded style sheet or Javascript, how can you tell whether that stuff contains words for indexing or not?

In fact the way to tell what is content in a Wikipedia page appears to be to look between <!-- start content --> and <!-- end content -->; how general is that?

What do you do if some of the textual content of a page is presented inside images (as in this page)?

This article is about "databases". Is that the same as "data-bases" or "data bases"?

My name is "Richard". For many centuries a popular nickname for people with that name has been "Dick". So are "Richard" and "Dick" (which don't even start with the same letter) the same word? I don't like being called "Dick", not at all, but if someone does that, they probably don't mean to be rude. But with the addition of the letter "a", they would be being rude. So are "Dick" and "dick" the same word? Surely "My" and "my", "But" and "but", "The" and "the" are the same words?

That's an example of homophones. What's a "barn"? From the OED: "In nuclear physics, 10-24 cm2, a unit of area used in the measurement of the cross-section of a nucleus", originally from "as easy to hit as the side of a barn". You know what a "key" is. It unlocks things. No, wait. It's "a wharf or quay". No, wait, it's "a low island, sand-bank, or reef" as in the Florida Keys. So a key (sense 2) is the same as a "quay" and a key (sense 3) is the same as a "cay". How do you tell when "key" and "cay" are the same word? How do you tell when they are not?

Returning to our original example, what's the relationship between "relate", "relates", "related", "relating", "relater", "relaters", "relative", "relatives", "relatively", "relation", "relations", "relational", "relationally", "relationship", "relationships" and other such words?

Stemming

Human languages are sometimes classified as

Maori is very nearly a good example of an isolating language. There is a small number of prefixes that may be added to words, such as the causative prefix "whaka-", but only about six nouns (all relating to people) are ever marked for singular/plural, and verbs are not marked with information about subject or tense.

Latin, Greek, and Russian are good examples of inflecting languages. English used to be an inflecting language but has moved a fair distance in the direction of isolation.

Inuit languages are commonly cited as agglutinating. To an English speaker, German sometimes looks rather like that, but it isn't really.

Morphological analysis is the process of reversing the gluing-together and changing processes to convert a lexical word into a sequence of dictionary words. For example, "gamesmanship" can be broken into "((GAME+(-S))+MAN)+(-SHIP)" (the art of being a man who wins at games other than by being good at them). Here GAME and MAN are called "free morphemes" because they can occur as separate words, and -S and -SHIP are called "bound morphemes" because they can only occur attached to something else.

Full morphological analysis can be done moderately fast using two-level finite state automata combined with a morphological grammar that says what kinds of morpheme sequences are allowed. For example, "antiwar" makes sense but "waranti" doesn't, and "un((help)ful)" makes sense but there is no "unhelp". It also requires a reasonably complete dictionary. What's more, sometimes the results are uncertain because a word can be taken apart more than one way: is a lighthousekeeper a keeper of lighthouses or a housekeeper who is light?

Information Retrieval people want to deal with very large amounts of unrestricted text. This makes "reasonably complete dictionaries" hard to come by. Even the gigantic Oxford English Dictionary is not complete and never will be. Technical terms and brand names and speling misteaks will keep on coming. And even two-level finite state automata are not blindingly fast. So IR uses an approximation to morphological analysis called "stemming". What stemming actually involves may be different for different languages. In English it could be as simple as "remove a final S, if there is one". In other languages where a word can have hundreds or even thousands for forms, it would have to be more elaborate. For Maori it would be the identity transformation. In all cases, the idea is to get quickly from a word to a character sequence such that the commonest variants of the word will be mapped to the same string. It doesn't have to be right, it just has to be close enough to right to be useful.