Brain Dump

Stoplist

Tags
text-processing

A form of term-manipulation. A stop list is a list of words which don't give a good indication about the contents of the document (eg. the). These are often the most frequent words in the document. We then don't maintain an index for these words because they aren't useful.

Here's a list of common stoplist words:

a, about, above, across, always, am, among, amongs, both, being, co, could.

Too Strong

A common issue with the usage of a stop list is that how do we then search for phrases like "to be or not to be"? A common approach would be to allow multiword terms however this can lead to [see page 7, huge index] sizes with quite a lot of redundancy.

Another approach would be to identify multi word phrases during retrieval, for example by storing the position of terms in a document and then detect when two in the query phrase appear close to each other.