At Recast.AI, we use Natural Language Understanding (NLU) as a way to enrich input from users, and context is an important part of the process. During the handling of our inputs, we are led to decide whether or not a sentence given is corresponding to a specific meaning (we call it Intention). In order to do that, we need to understand what’s the user is saying by using the context and the sense of each words.
One day, we stumbled upon a sentence our software wasn’t able to categorize. It was a simple sentence, counting exactly 7 words:
“The workers at the plant were overworked.”
Our current program wasn’t capable of understanding the meaning of plant, failing over and over to categorize it as an industrial facility.
After a few searches, we found out lots of papers describing and trying to solve our problem, and from here began our trip into the depths of one of the oldest open problems of NLU.
Word Sense Disambiguation
The task of Word Sense Disambiguation (WSD) consists of selecting the best sense for an occurrence of a word in a given context. In our case, we had to find the correct sense for the word plant.
This issue was discovered at the creation of Machine Translation (1940s) and at first, people argued about the solubility of such an issue, declaring that it was impossible to model all the knowledge for a computer to understand.
Three decades later, linguists started to create rule-based systems to try to solve WSD, relying on hand-crafted knowledge bases, such as OALD, LDOCE and Roget’s Thesaurus.
The next step (1980s) was to replace the rules by a knowledge extraction from those sources, thus automatizing a part of the process.
In the 1990s, probabilistics models appeared, and WSD was applied Machine Learning techniques, which have really convincing results.
The 2000 century brought to WSD the knowledge from the Web. This created an emulsion, researchers tried to mix Machine Learning and knowledge bases, by creating Unsupervised Machine Learning methods, and others hybrids algorithms.
Below, you can find a non-exhaustive list of the algorithms and Machine Learning techniques used in WSD:
Every solution has its perks and flaws, but the state-of-the-art in the Semi-Supervised Machine Learning field is achieving more than 90% accuracy.
Our goal was to get our hands dirty with a first implementation of a solution, and we wanted a good compromise between simplicity and efficiency.
So, after a few days of research and thinking, we finally decided to stick with the traditional approach of a Knowledge-based algorithm.
The Lesk algorithm (Michael E. Lesk, 1986) is based on the assumption that words in a sentence will share a common topic. The idea is simple: given a sentence, the algorithm selects the senses whose definitions have the maximum overlap (the highest number of common words).
We can represent this by a simple picture, see below:
What’s really interesting with this schema, is that it shows the resemblances between the Lesk algorithm and our brain’s process. Yes, that’s right, Lesk saw the way we analyze a sentence in real-time, and implemented a solution copying that operation. Moreover, we posses and use a lot of information to disambiguate a word, which come mainly from our education.
Since our solution is knowledge-based, we needed a dictionary constructed in a way a computer could use it: we had to find an education model for our algorithm.
We decided to use the Princeton University’s WordNet, created in 1995 by George A. Miller. WordNet is a lexical database for the English language: It groups English words into synsets (synonyms sets), provides short definitions and usage examples, and records a number of relations among these synonym sets or their members.
After 8 releases and 20 years, WordNet contains 117 954 nouns, 21 500 adjectives, 11 541 verbs and 4 476 adverbs. That’s huge!
Here’s what the different meanings of plant look like in WordNet:
Unfortunately the original interface is written in C, thus we built a custom Ruby API, that we called Omniscient.
Omniscient allows us to retrieve most of the semantic relation a word has. For example, every synset of plant will be linked to synonyms, antonyms, hyponyms (children), hypernyms (parents), meronyms (part of), holonyms (whole of), etc. Those relations are used to create a glossbag, a bag of definitions related to a word’s sense.
Once the glossbag is done for every word in the sentence, we use the Lesk algorithm to compare their similarity, and we choose the best score of all the comparisons.
At first, the results were quite disappointing, but after a few tweaks and improvements, here’s what we had:
“The workers at the plant were overworked.”
8.76 | (n) plant, works, industrial plant (buildings for carrying on industrial labor)
7.24 | (n) plant, flora, plant life (a living organism)
2.56 | (n) plant (something planted secretly for discovery by another)
0.08 | (n) plant (an actor situated in the audience)
Looks good, right?
We can try with another sense, see:
“The plant was no longer bearing flowers.”
9.26 | (n) plant, flora, plant life (a living organism)
7.77 | (n) plant, works, industrial plant (buildings for carrying on industrial labor)
3.30 | (n) plant (something planted secretly for discovery by another)
1.01 | (n) plant (an actor situated in the audience)
Word Sense Disambiguation can seem trivial, but it is the first step towards an automation of the understanding, or even the generation, of natural language.
In this article, we looked at the technical part of understanding the user, but is it the only thing required for an AI to be considered good ?
Paul RENVOISÉ – Recast.AI
Comparing Similarity Measures for Lesk Algorithm, Torres and Gelbukh
An Adapted Lesk Algorithm for WSD Using Wordnet, Banerjee and Pedersen
WSD using Wordnet and the Lesk Algorithm, Ekedahl and Golub
Unsupervised WSD Rivaling Supervised Methods, Yarowsky