Information Retrieval
- Tags
- text-processing
The [see page 8, task] is given a large static document collection and an information need (query) find all documents relevant to the query. This is a possible application of text processing. A common IR system is a Web Search Engine.
IR results can be presented as an:
- unsorted list
- ranked list
- clusters (group results by shared themes, for query "jaguar" one cluster can be car related, the other is animal related).
We define:
Term | Description |
---|---|
Tokenisation | Extract words from source. For example splitting words from punctuation (word-based => word based) |
Capitalisation | Normalize all words to the case. Words with different cases but the same spelling should be the same. |
Lemmatisation | Conflate inflected forms to their basic (or dictionary) form. (eg. have, has, had => have). |
Stemming | Lemmatisation: involving choping of [see page 4, suffixes] from words to conflate morphological variants. |
Normalisation | Lemmatisation: use heuristics to conflate variants due to spelling, hyphenation, spaces (eg. U.S.A => USA) |
[see page 17, Advantages vs Disadvantages]
- Can search a huge collection rapidly
- Insensitive to genre and domain of texts
Relatively straighter to implement
documents are returned rather than information/answers.
- manual inspection of returned documents is required to judge relevance
- output is unstructured so further processing is limited
Too accepting, countless 1000s of results, many after first 2 pages irrelevant.