Inter-assessor agreement at INEX 06

Nils Pharo, Andrew Trotman, Shlomo Geva, Benjamin Piwowarski

 

ABSTRACT

A basic requirement of IR test collection is to have a pool of documents with relevance assessments. Such relevance assessments are typically performed by topic experts. The procedure at INEX is having the participating groups submitting topics which are later returned to topic creators for relevance assessment. Usually the participants perform relevance assessments of their own topics, but some topics are assessed by more than one assessor.

 

At the INEX 06 workshop an experiment was conducted in order to measure interassessor-agreement. Since the INEX 06 collection consists of articles from Wikipedia it is relatively easy to design topics (simulated work tasks) of general interest, thus making it easier for other than topic creators to perform relevance judgments. Participants at the workshop were asked to perform relevance assessments of a selection of topics, which were logged in the online assessment system. The assessors also answered questionnaires related to the experiment. The main purpose of this experiment was to measure the level of agreement between many assessors judging the same topics. We present the findings and relate them to background factors.

 

The findings from our analysis can be used for several purposes: 1) they can help us design the procedures for relevance assessments in future INEX experiments; 2) they provide a valuable understanding of the general factors influencing relevance judgments used in test collections; and 3) they can be used to argue for the validity of using the test collection approach as a method for testing the retrieval efficiency of IR systems.

 

[Return to Andrew’s Home]