Brain Dump

Information Retrieval Evaluation

Tags
text-processing

The evaluation of effectiveness for a information retrieval system is the relevance of the retrieved documents. This is generally judged in a binary way (either is or isn't relevant) because judging countless possible options for relevance isn't practical.

Other possible factors are:

  • user effort / ease of use (complicated query bad)
  • response time
  • form of presentation

We test relevance in practical terms using a [see page 5, gold standard] data set with:

  • a standard set of documents and queries
  • a list of documents judged relevant for each query (by humans)
  • relevance scores (often binary)

Evaluating Precision and Recall

We define (see see page 6, here for meaning of variables):

TermMeaning
RecallMissing argument for \frac the proportion of relevant documents returned. I got A relevent documents but C of them are missing.
PrecisionMissing argument for \frac the proportion of retrieved documents that're relevant. I got A relevant documents and B irrelevent ones.

Precision is often inversely proportional to the harmonic mean so we often use an $F$-score (the [see page 18, harmonic mean]) of the two values to get a concrete evaluation for a IR system. This metric gives equal importance to precision and recall, if one of these is more important than the other then you can weight the metric Fβ.

Evaluating Order of Relevance

Measures how well a method ranks relevant documents before non-relevant documents.

Generally we only inspect the first few pages of a search result so even if a query has a low precision or recall, so long as enough of the relevant documents are shown first, the system is good.

To measure this we often consider the precision at some [see page 21, rank], n, in the ordered result set for an IR system. Essentially the percentage of relevant documents in the first n returned documents.