Indexing
- Tags
- text-processing
IR commonly requires indexing:
Accurately find terms that can be associated with a document (from the document).
Various indexing schemes exist for various purposes, such as:
- Dewey Decimal System
- [see page 13, ACM] - Subfields of CS
- [see page 14, MeSH] - Medical Subject Headings
Manual Indexing
Having users (librarians, researchers, etc.) manually index documents for search
engines to better search them. This document is in category literature, it contains
these relevant terms.
Automatic Indexing
Automatic indexing uses the natural language of the text to automatically index
documents (words in the document give information about it's content
).
For [see page 22, example] we id each document and create a lookup table of tokens which map to a
collection of relevant document ids. For example we can have a document foo
with
the id 1
and the keywords bar
, baz
. We create records for the tokens bar
and baz
and have them point to the id of foo
, 1
. Now we can quickly tell
that document foo
is relevant to the keywords bar
and baz
. This construct is
called inverted files and we can store as many document links as needed for each
token.
There are variations of the above approach where we can count the number of occurrences of a token, or where they occur. These can be adapted to better discover relevant documents.