And serving it as a REST API
This post describes how to build a simple multilingual spell-checker service in Python.
Spell-checkers are a common utility in a variety of softwares of daily use, and imbue search engines, text messaging applications and virtual assistants with unparalleled support to the end user.
In this post, we leverage handy APIs, writing a wrapper around them, without diving into the inner mechanisms underneath the spell-checking process.
Given an input sentence, this service determines the language first, with an associated probability:
Then, it checks for misspellings in the input text, based on the detected language. For each misspelled word, a correct alternative is provided, resulting in a recommended sentence. A measure of similarity between input and recommendation is also returned:
Process and set up
1. Language detection
In particular, we use its handy detect_langs function that, given an input text, provides a list of potential languages, with their corresponding probabilities.
We use PyEnchant to detect misspells and perform suggestions. PyEnchant offers Python bindings for the Enchant spellchecking library, that provides a unified wrapper for different backends (such as Hunspell, Aspell , …).
The API is simple to use, but there are two prerequisites to be met:
- Install Enchant through the package manager, depending on the operating system and package availability (for example, Ubuntu’s available packages can be found here). In our case:
sudo apt-get install libenchant-dev
- Install the dictionaries for the needed languages. We use Hunspell dictionaries for Italian, Spanish, German, French (and English), but more languages are supported:
sudo apt-get install hunspell-it hunspell-es hunspell-de-de hunspell-fr
3. Similarity between words
When returning a suggestion, we also want to provide a measure of similarity between the recommendation and the original text, because an estimate of how close is a suggested sentence from the original one may be used for filtering purposes.
For this estimate we use the SequenceMatcher object from the difflib library to compute a similarity measure (S) between each word of the recommended (R) versus original text (T), and then we perform an average over the number of words (tokens):
In this example, we are using languages whose tokenization, i.e. the process of splitting a sentence or chunk of text into a sequence of words, can leverage the presence of spaces between words.
Other languages, such as Japanese, do not fall under this hypothesis, and would require different tokenization strategies.
We begin with the definition of the objects we are going to use:
- Language: composed by an identifier (en, es, fr, …) and a probability.
- SpellCheck, composed by the suggested sentence and the similarity score.
- Response: Language and SpellCheck objects, returned for a given text input.
Then, we define four simple functions:
- detect_language: given an input text and a confidence threshold, it returns the identified language and its probability.
- map_language_to_dict: maps the identified language to the corresponding Hunspell dictionary.
- spellcheck: given the input text and a Hunspell dictionary, identifies mispelled words and provides an alternate suggestion together with a measure of similarity.
- process_input_text: wrapper around the previous functions. Takes an input text and a minimum probability (default: 0.9) under which the identified language is rejected. It returns a Response object.
Let’s see how it works
We can qualitatively assess the behaviour of the process by analyzing some sentences from different languages, with different spelling mistakes:
Although these few, random examples cannot be considered a proper test, the outputs we obtain are correct for all languages.
1. Language detection on short/equivocal sentences
The langdetect algorithm is non-deterministic, and the user may obtain different results by applying it to the same sentence, in case it is short or equivocal.
Consistency may be achieved with a seed:
from langdetect import DetectorFactory
DetectorFactory.seed = 0
And we may use the threshold on the probability to reject a language detection performed on ambiguous/short text:
Moreover, even pre-processing fixed rules may be implemented to manage short text, for example by modifying the input probability threshold in case the input text was made of a number of tokens lower than a preconfigured amount.
2. The Cupertino effect
The spell-checker may replace correct words, in case they were not available in its dictionaries/knowledge base:
This is known as Cupertino effect.
One possible way to mitigate this scenario is to extend the dictionaries with custom knowledge bases.
In PyEnchant, this may be achieved through the DictWithPWL class (PWL stands for “Personal Word List”). For example, assuming to have a personal word list in the file “custom_list.txt” to extend the “en_GB” dictionary:
d = enchant.DictWithPWL("en_GB", "custom_list.txt")
3. Similar words may be not enough
In addition to the threshold on the probability for the detected language, a second threshold on the similarity between the recommended and the original input may avoid inappropriate suggestions:
But this may be insufficient: in this simple example, we perform the suggestions over single words, without taking into account the overall context nor any business purpose.
For example, the word “policey” may be corrected as “police” or “policy”, but the context or business purpose may lead to the exclusion of one of two alternatives.
Moreover, for each term we always considered the first suggestion only, but the entire suggestion list may be taken into account, and blended with context awareness and/or business rules, in order to create the most appropriate suggestions for the use case.
Providing a REST API
Finally, we make the service available as a web service using the Flask framework:
- Create a project folder.
- Save the first code snippet (with the objects definition) as “model.py” inside the folder.
- Save the second code snippet (with the functions) as “suggester.py” inside the folder, and import the objects from the model file.
- Create the file “app.py” as follows:
Basically, we provide the /analyze-sentence endpoint with a POST method, with text and min_prob expected inside the request body.
From the folder, executing the command “python app.py” launches the application and makes the service available (the default port is 5000). Here is an example of API call from Postman: