Vector Space Model
An information retrieval model where we represent each document as a point in a [see page 61, high-dimensional vector space]. With each term in the document being a single dimension (axis). The value of the document at each axis can be determined by the frequency of the term in that document (see [see page 67, here]).
We create a vector for the query as well and then using some model return only the closest \[n\] documents for the query. The model for closeness can vary depending on the domain.
A common closeness model would be the shortest
[see page 68, euclidean distance] between the
vectors. However this approach can be affected by the [see page 69, frequency of terms] in the
document, leading to skewed results for otherwise good matches.
A better approach for similarity is the [see page 74, cosine] distance between vectors. This
approach computes how well the two vectores correlate and then divides by the
length of the vector to adjust for magnitude
.
Note: our cosine measure will never be negative because the frequency of words ranges from 0 to some positive number. It's not possible for two vectors to be completely orthogonal to each other.