Andrew
Trotman and Mounia Lalmas
Proceedings of the INEX 2005
Workshop on Element Retrieval Methodology pp. 1-3
With a wealth of
documents originating in markup languages such as XML, it is appropriate to ask
how this markup might be used in information retrieval. One answer is to change the focus of
retrieval from whole documents to document elements.
In
document-centric IR the user searches whole documents and is returned a ranked
list of documents that match their queries.
By contrast, in element retrieval document elements are returned –
perhaps a chapter of a book, or a section of an academic paper.
Since 2002 the
annual INEX workshop [2] has been examining element ranking algorithms for XML
documents. Most specifically, the IEEE
collection of 12,107 documents. Arguably
progress has been made.
It is this
“arguably” that has become the center of attention. On the outset it would appear as though
element retrieval is a simple derivation of document retrieval – but experience
at INEX has shown this to be far from the truth.
A document centric
search engine makes a binary decision about the relevance of a given document –
either it will appear in a result list or it will not. It cannot “partly appear”.
An element centric
search engine having decided a piece of text is relevant is faced with how to
return that information. Perhaps only a
paragraph is relevant, or perhaps the sub-section, or the section, or it may be
the entire document. The same piece of
text can be returned in many different ways.
When humans are
making judgment decisions, they too, are faced with similar problems. If a given paragraph is relevant, then surely
a containing section is also relevant.
How much more so, or less so?
Combining these,
how can the performance of a search engine be measured?
There are clearly
methodological issues in element retrieval, and these need addressing. It is these issues that are of interest at
this workshop.
For many the most
pressing issues is this: when there is no community accepted methodology it is
not possible to claim any one system is better than any other.