Brain Dump

Vector Space Model

Tags
information-retrieval

An information retrieval model where we represent each document as a point in a [see page 61, high-dimensional vector space]. With each term in the document being a single dimension (axis). The value of the document at each axis can be determined by the frequency of the term in that document (see [see page 67, here]).

We create a vector for the query as well and then using some model return only the closest \[n\] documents for the query. The model for closeness can vary depending on the domain.

A common closeness model would be the shortest [see page 68, euclidean distance] between the vectors. However this approach can be affected by the [see page 69, frequency of terms] in the document, leading to skewed results for otherwise good matches.

A better approach for similarity is the [see page 74, cosine] distance between vectors. This approach computes how well the two vectores correlate and then divides by the length of the vector to adjust for magnitude.

Note: our cosine measure will never be negative because the frequency of words ranges from 0 to some positive number. It's not possible for two vectors to be completely orthogonal to each other.