Information Retrieval Evaluation
- Tags
- text-processing
The evaluation of effectiveness for a information retrieval system is the relevance of the retrieved documents. This is generally judged in a binary way (either is or isn't relevant) because judging countless possible options for relevance isn't practical.
Other possible factors are:
- user effort / ease of use (complicated query bad)
- response time
- form of presentation
We test relevance in practical terms using a [see page 5, gold standard] data set with:
- a standard set of documents and queries
- a list of documents judged relevant for each query (by humans)
- relevance scores (often binary)
Evaluating Precision and Recall
We define (see see page 6, here for meaning of variables):
Term | Meaning |
---|---|
Recall | \(\frac{A/A+C}\) the proportion of relevant documents returned. I got A relevent documents but C of them are missing . |
Precision | \(\frac{A/A+B}\) the proportion of retrieved documents that're relevant. I got A relevant documents and B irrelevent ones . |
Precision is often inversely proportional to the harmonic mean so we often use an $F$-score (the [see page 18, harmonic mean]) of the two values to get a concrete evaluation for a IR system. This metric gives equal importance to precision and recall, if one of these is more important than the other then you can weight the metric \(F_{\beta}\).
Evaluating Order of Relevance
Measures how well a method ranks relevant documents before non-relevant documents.
Generally we only inspect the first few pages of a search result so even if a query has a low precision or recall, so long as enough of the relevant documents are shown first, the system is good.
To measure this we often consider the precision at some [see page 21, rank], \(n\), in the ordered result set for an IR system. Essentially the percentage of relevant documents in the first \(n\) returned documents.