You may be thinking…
How on earth does my computer understand words? If you think that there is a person hiding in your computer, then we have news for you! There is not. There is however, a super complex algorithm that very smart data scientists created using this things called word vectors.
Word Vector Primer
Although many natural language processing (NLP) algorithms provide the appearance of understanding words and phrases, these algorithms primarily work with representations of words in a number space. In NLP, we call these representations word vectors. You can think of a vector as a spelling of a word that a computer understands in the context of a particular learned language. The following blog post gives a quick primer on what word vectors are and how they are used in NLP research.
Example Word Vectors
Word vectors allow computers to handle words as though they are numeric values, making it easy to execute operations and comparisons between them. The number-space that is generated is specific to the set of training data used. If the vector shown above was used in a number-space trained with a different set of corpuses, the numeric vector may refer to a completely different word.
As in normal speech, single words don’t mean much on their own, but have meaning and context when grouped with other words. Word vectors work the same way. They have meaning in the context of other vectors. The contents of the training corpus used to generate the number space will have an impact on the associations/meanings of words.
This is intuitive and has a parallel to how we learn language. For example, if the only book (corpus) a child (algorithm) was allowed to read (process) was “Green Eggs and Ham” and the child (algorithm) never had eggs, the child (algorithm) may assume that the adjective “green” is frequently associated with the noun “egg” and may assume that eggs are green. The direction of the word vector inside of this number space contains this meaning information about the vectorized word.
Very similar words have vectors which point in similar directions. Kings and Kaisers are both male European royalty; their word vectors would point in a similar direction. In a manner similar to real words, word vectors can be strung together through vector operations and still retain meaning. Consider the ubiquitous example of “king”, “queen”, “man”, and “woman”. Suppose you wanted to refer to the royalty that is like a king, but a woman, not a man. In a properly trained word vector set, these vector operations would allow the computer to know you were referring to a queen.
Toy Example of Word Vectors in a 2D space
What does this all mean
How a computer learns sets of word vectors can have a significant influence over how effective NLP algorithms are at representing real speech. Behind the scenes of NLP calculations and associations are being made quickly to find numerical value of the words that you type. Depending on those associations your NLP engine may perform better than others. It is all about the context and the quality of the data and good thing too! You wouldn’t want your computer to order you green eggs from the store… that would not be fun at all!
Corpus: Large set of unstructured documents that serve as training material to natural language processing algorithms
Natural Language Processing: applied mathematical models that understand and generate human speech/language
Word Vector: a distributed word representation in number space to capture semantic associations in vocabulary