Brain Dump

Continuous Word Representations

Tags
text-processing

Is the process of converting words to vectors in a way which can capture semantic meaning.

word2vec offers representation fo word meaning that is much richer than any other term-weighting scheme (binary, tfm, tfidf).

Proposes techniques to pre-train fixed length vectors for every word from very large corpora.

In CWR each word is represented by a vector itself, rather than VSMs approach of plotting the word as an axis in some n-dimensional coordinate space.

Approaches

There're two approaches for CWR based on training neural networks. The pre-final layer of the NN is extracted and treated as the vector representation of the word. This is also sometimes referred to as the words embedding.

These embedding can capture important aspects of word meaning:

  • vectors of semantically similar words are close to each other. eg. Strongs vector is near Powerful.
  • Analogous semantic relationships are captured and can be demonstrated using vector arithmetic. See [see page 62, here].

How do you use CWR to search a document?

  • Represent word sequences (doc, query) as sum/avg of word vectors.
  • Represent word sequence as a bag of embedded words.
  • Train [see page 70, additional] document vector in conjunction with word vectors in manner similar to word2vec.