Brain Dump

Term Weighting

Tags
text-processing

A form of term-manipulation where we weight given terms in a query (and/or collection) to better find relevant documents.

We [see page 10, define]:

NameSymbolDescription
Document Collection\(D\)collection (set) of documents
Size of Collection\(\mid D \mid\)Total number of documents in collection
Term Frequency\(tf_{w,d}\)Number of times word w occurs in document d
Collection Frequency\(cf_w\)Number of times w occurs in collection
Document Frequency\(df_w\)Number of documents containing w

We also [see page 11, find] that the collection frequency isn't a very good measure of the relevence of a term in a document.

For example the same word could appear 1000 times in one document and another word
100 times in 10. These terms would have the same collection frequency however the
document frequency is vastly different. When searching for the both words in the
same query, the document in which it appears 1000 times is much more relevant then
the others.

Because of this we [see page 12, define] \(\frac{\mid D \mid}{df_w}\) as the weighting value for a term This is commonly referred to as the inverse document frequency metric. Due to it being very large for small \(df_w\) we often take the logarithm of this fraction.

We [see page 13, multiply] the inverse document frequency by the term frequency for each term in the query to get a score for a document This is the most commonly used method for term weighting in information retrieval and summarisation.

Note: the weighting themselves don't tell us how appropriate a document is to a query, we still need an appropriate IR model such as the vector space model alongside the weighted [see page 16, cosine distance] between the query and each document in the index.