Brain Dump

Information Retrieval Evaluation

Tags
text-processing

The evaluation of effectiveness for a information retrieval system is the relevance of the retrieved documents. This is generally judged in a binary way (either is or isn't relevant) because judging countless possible options for relevance isn't practical.

Other possible factors are:

  • user effort / ease of use (complicated query bad)
  • response time
  • form of presentation

We test relevance in practical terms using a [see page 5, gold standard] data set with:

  • a standard set of documents and queries
  • a list of documents judged relevant for each query (by humans)
  • relevance scores (often binary)

Evaluating Precision and Recall

We define (see see page 6, here for meaning of variables):

TermMeaning
Recall\(\frac{A/A+C}\) the proportion of relevant documents returned. I got A relevent documents but C of them are missing.
Precision\(\frac{A/A+B}\) the proportion of retrieved documents that're relevant. I got A relevant documents and B irrelevent ones.

Precision is often inversely proportional to the harmonic mean so we often use an $F$-score (the [see page 18, harmonic mean]) of the two values to get a concrete evaluation for a IR system. This metric gives equal importance to precision and recall, if one of these is more important than the other then you can weight the metric \(F_{\beta}\).

Evaluating Order of Relevance

Measures how well a method ranks relevant documents before non-relevant documents.

Generally we only inspect the first few pages of a search result so even if a query has a low precision or recall, so long as enough of the relevant documents are shown first, the system is good.

To measure this we often consider the precision at some [see page 21, rank], \(n\), in the ordered result set for an IR system. Essentially the percentage of relevant documents in the first \(n\) returned documents.