Brain Dump

Information Retrieval

Tags
text-processing

The [see page 8, task] is given a large static document collection and an information need (query) find all documents relevant to the query. This is a possible application of text processing. A common IR system is a Web Search Engine.

IR results can be presented as an:

  • unsorted list
  • ranked list
  • clusters (group results by shared themes, for query "jaguar" one cluster can be car related, the other is animal related).

We define:

TermDescription
TokenisationExtract words from source. For example splitting words from punctuation (word-based => word based)
CapitalisationNormalize all words to the case. Words with different cases but the same spelling should be the same.
LemmatisationConflate inflected forms to their basic (or dictionary) form. (eg. have, has, had => have).
StemmingLemmatisation: involving choping of [see page 4, suffixes] from words to conflate morphological variants.
NormalisationLemmatisation: use heuristics to conflate variants due to spelling, hyphenation, spaces (eg. U.S.A => USA)

[see page 17, Advantages vs Disadvantages]

  • Can search a huge collection rapidly
  • Insensitive to genre and domain of texts
  • Relatively straighter to implement

  • documents are returned rather than information/answers.

    • manual inspection of returned documents is required to judge relevance
    • output is unstructured so further processing is limited
  • Too accepting, countless 1000s of results, many after first 2 pages irrelevant.