Brain Dump

Indexing

Tags
text-processing

IR commonly requires indexing:

Accurately find terms that can be associated with a document (from the document).

Various indexing schemes exist for various purposes, such as:

  • Dewey Decimal System
  • [see page 13, ACM] - Subfields of CS
  • [see page 14, MeSH] - Medical Subject Headings

Manual Indexing

Having users (librarians, researchers, etc.) manually index documents for search engines to better search them. This document is in category literature, it contains these relevant terms.

Automatic Indexing

Automatic indexing uses the natural language of the text to automatically index documents (words in the document give information about it's content).

For [see page 22, example] we id each document and create a lookup table of tokens which map to a collection of relevant document ids. For example we can have a document foo with the id 1 and the keywords bar, baz. We create records for the tokens bar and baz and have them point to the id of foo, 1. Now we can quickly tell that document foo is relevant to the keywords bar and baz. This construct is called inverted files and we can store as many document links as needed for each token.

There are variations of the above approach where we can count the number of occurrences of a token, or where they occur. These can be adapted to better discover relevant documents.