Information Retrieval Evaluation

Tags: text-processing

The evaluation of effectiveness for a information retrieval system is the relevance of the retrieved documents. This is generally judged in a binary way (either is or isn't relevant) because judging countless possible options for relevance isn't practical.

Other possible factors are:

user effort / ease of use (complicated query bad)
response time
form of presentation

We test relevance in practical terms using a [see page 5, gold standard] data set with:

a standard set of documents and queries
a list of documents judged relevant for each query (by humans)
relevance scores (often binary)

Evaluating Precision and Recall

We define (see see page 6, here for meaning of variables):

Term	Meaning
Recall	$\frac{A/A+C}$ the proportion of relevant documents returned. `I got A relevent documents but C of them are missing`.
Precision	$\frac{A/A+B}$ the proportion of retrieved documents that're relevant. `I got A relevant documents and B irrelevent ones`.

Precision is often inversely proportional to the harmonic mean so we often use an $F$-score (the [see page 18, harmonic mean]) of the two values to get a concrete evaluation for a IR system. This metric gives equal importance to precision and recall, if one of these is more important than the other then you can weight the metric $F_{\beta}$.

Evaluating Order of Relevance

Measures how well a method ranks relevant documents before non-relevant documents.

Generally we only inspect the first few pages of a search result so even if a query has a low precision or recall, so long as enough of the relevant documents are shown first, the system is good.

To measure this we often consider the precision at some [see page 21, rank], $n$, in the ordered result set for an IR system. Essentially the percentage of relevant documents in the first $n$ returned documents.

Term	Meaning
Recall	\(\frac{A/A+C}\) the proportion of relevant documents returned. `I got A relevent documents but C of them are missing`.
Precision	\(\frac{A/A+B}\) the proportion of retrieved documents that're relevant. `I got A relevant documents and B irrelevent ones`.