Information Retrieval

Tags: text-processing

The [see page 8, task] is given a large static document collection and an information need (query) find all documents relevant to the query. This is a possible application of text processing. A common IR system is a Web Search Engine.

IR results can be presented as an:

unsorted list
ranked list
clusters (group results by shared themes, for query "jaguar" one cluster can be car related, the other is animal related).

We define:

Term	Description
Tokenisation	Extract words from source. For example splitting words from punctuation (word-based => word based)
Capitalisation	Normalize all words to the case. Words with different cases but the same spelling should be the same.
Lemmatisation	Conflate inflected forms to their basic (or dictionary) form. (eg. have, has, had => have).
Stemming	Lemmatisation: involving choping of [see page 4, suffixes] from words to conflate morphological variants.
Normalisation	Lemmatisation: use heuristics to conflate variants due to spelling, hyphenation, spaces (eg. U.S.A => USA)

[see page 17, Advantages vs Disadvantages]

Can search a huge collection rapidly
Insensitive to genre and domain of texts
Relatively straighter to implement
documents are returned rather than information/answers.
- manual inspection of returned documents is required to judge relevance
- output is unstructured so further processing is limited
Too accepting, countless 1000s of results, many after first 2 pages irrelevant.

Brain Dump

Information Retrieval

[see page 17, Advantages vs Disadvantages]

Links

Links to this note