Dylan
Jenkinson, Andrew Trotman
Ad hoc passage
retrieval within the Wikipedia is examined in the context of INEX 2007. An analysis of the INEX 2006 assessments
suggests that fixed sized window of about 300 terms is consistently seen and
that this might be a good retrieval strategy.
In runs submitted to INEX, potentially relevant documents were
identified using BM25 (trained on INEX 2006 data). For each potentially relevant document the
location of every search term was identified and the center (mean) located. A fixed sized window was then centered on
this location. A method of removing
outliers was examined in which all terms occurring outside one standard
deviation of the center were considered outliers and the center recomputed
without them. Both techniques were examined
with and without stemming.
For Wikipedia linking we identified terms within the document that were over-represented
and from the top few generated queries of different lengths. A BM25 ranking search engine was used to
identify potentially relevant documents.
Links from the source document to the potentially relevant documents
(and back) were constructed (at a granularity of whole document). The best performing run used the 4 most over-represented
search terms to retrieve 200 documents, and the next 4 to retrieve 50 more.