At Recast.AI, we use Natural Language Processing (NLP) as a way to enrich input from users, and context is an important part of the process. When we handle a user’s request, we decide whether or not it matches an intent, which is the general meaning of a sentence. In order to do that, we need to understand what the user is saying by analyzing the context and the meanings of each word.
One day, I came across a sentence our software couldn’t categorize correctly. It was a simple sentence, only 7 words:
“The workers at the plant were overworked.”
Our current program wasn’t capable of understanding the meaning of plant, failing over and over to understand it as an industrial facility.
After a few searches, we found lots of papers on the problem, and from here we began to explore one of the oldest unresolved issues of NLP.
Word Sense Disambiguation
The task of Word Sense Disambiguation (WSD) consists of selecting the best sense for an occurrence of a word in a given context. In our case, we had to find the correct sense for the word plant.
This issue was discovered at the creation of Machine Translation (roughly in the 40s), and is still unresolved now.
Researchers tried various approaches, from rule-based to probabilistic systems, through fine grained knowledge base and automated knowledge extraction. All these efforts weren’t enough to solve this problem.
To give you an idea, let me show you a list of algorithms and Machine Learning techniques used in WSD:
Every solution has its pros and cons, but the state-of-the-art in the Supervised Machine Learning field is achieving more than 90% accuracy.
Our goal was to find the best compromise between simplicity and efficiency as our first implementation.
Flora, Stratagem or Building?
The underlying problem resides in the fact that every word can have really different meanings. Finding and selecting the right meaning for the right context is decisive for the comprehension of natural language.
Here come WordNet, a lexical database for English. It groups words into synonym sets, providing short definitions and examples, and records the relationships between those groups.
WordNet allows us to retrieve synonyms, antonyms, and different forms ot a word. For each of these, we get a definition.
Once we have collected all the definitions of each word in the sentence, we use the Lesk algorithm to compare their similarity.
The Lesk algorithm (Michael E. Lesk, 1986) is based on the assumption that words in a sentence will share a common topic. The idea is simple: We take all the possible definitions of each word in a sentence and select the definitions which overlap the most.
We can represent this in a simple diagram, see below:
What’s really interesting with this diagram, is that it shows the similarities between the Lesk algorithm and the way our brain works.
By using a knowledge base and this crafted “brain”, we can now select the best sense for the word plant in the context of our sentence, allowing us to generate its synonyms, antonyms and so on.
Word Sense Disambiguation is used to know the sense of a word thanks to a given context. And it allows us to perform automatic translation, and even the the first stages of language generation.
Today, bots and AIs have proven they’re on the way to understanding language, but the next step – and most important one – is for them to be able to answer us by themselves!