Andrew
Trotman, Dylan Jenkinson
Information
retrieval test sets consist of three parts: documents, topics, and assessments.
Assessments are time-consuming to generate. Even using pooling it took about 7
hours per topic to assess for INEX 2006.
Traditionally the
assessment of a single topic is performed by a single human. Herein we examine
the consequences of using multiple assessors per topic.
A set of 15 topics
were used. The mean topic pool contained 98 documents. Between 3 and 5 separate
assessors (per topic) assessed all documents in a pool. One assessor was
designated baseline. All were then used to generate 10,000 synthetic
multi-assessor assessment sets.
The baseline
relative rank order of all runs submitted to the INEX 2006 relevant-in-context
task was compared to those of the synthetics. The mean Spearman’s rank
correlation coefficient was 0.986 and all coefficients were above 0.95 – the
correlation is very strong. Non matching rank-orders are seen when the mean
average precision difference between runs is less than 0.05. In the top 10 runs
no significantly different runs were ranked in a different order in more than
5% of the synthetics. Using multiple assessors per topic is very unlikely to
affect the outcome of an evaluation forum..