Continuous Word Representations
- Tags
- text-processing
Is the process of converting words to vectors in a way which can capture semantic meaning.
word2vec offers representation fo word meaning that is much richer than any other term-weighting scheme (binary, tfm, tfidf).
Proposes techniques to pre-train fixed length vectors for every word from very large corpora.
In CWR each word is represented by a vector itself, rather than VSMs approach of plotting the word as an axis in some n-dimensional coordinate space.
Approaches
There're two approaches for CWR based on training neural networks. The pre-final
layer of the NN is extracted and treated as the vector representation of the word.
This is also sometimes referred to as the words embedding.
These embedding can capture important aspects of word meaning:
- vectors of semantically similar words are close to each other. eg. Strongs vector is near Powerful.
- Analogous semantic relationships are captured and can be demonstrated using vector arithmetic. See [see page 62, here].
Applications in Search
How do you use CWR to search a document?
- Represent word sequences (doc, query) as sum/avg of word vectors.
- Represent word sequence as a bag of embedded words.
- Train [see page 70, additional] document vector in conjunction with word vectors in manner similar to word2vec.