Typo tolerance

Search with typo tolerance

Everyone makes mistakes. That's why undo/redo is a must-have for interfaces where you work on something. That's why we add extra padding around clickable elements for touch screens so it's easier to catch touches. That's why Google tries to show some results even if what you typed is far from perfect.

Users absolutely love that and they can't imagine software without ctrl+z and looking at a "No results" page when they mistyped something. It seems that the bar is high... but still, a lot of software does only what is convenient for developers when it comes to searching and showing results.

Examining the problem

Below we have a simple search that's going to work like filtering on the list. The list is short so it's going to be easy to understand what's happening. In other words, we already have all elements on the screen but searching is going to help us in finding that thing.

Please look at the list and type something you see there, misspell something or type something completely different:

  • Apple
    Apple
  • Banana
    Banana
  • Blueberry
    Blueberry
  • Cherries
    Cherries
  • Lemon
    Lemon
  • Pear
    Pear
  • Pineapple
    Pineapple
  • Tomato
    Tomato

What we've just used here is a simple "contain" query. Or if you are familiar with SQL - we perform %LIKE% here. Is it bad? Well, it's okay. Better than strict comparison for sure. But it's not super friendly because you have to be right.

The hearth of this method is highlighted in the code below. We filter the list by checking if any fruit name contains the searched text. There is a bit of user-friendliness here - the search is case insensitive which is the desired behavior in most text searches done by users:

const FRUITS = ['Apple', 'Banana', 'Blueberry', 'Cherries' /* etc... */];
function searchHandler(event) {
const searchText = event.target.value.toLowerCase();
const filteredFruits = FRUITS.filter((fruit) => {
return fruit.toLowerCase().includes(searchText);
});
// render the list of `filteredFruits`
}

Introducing a tolerance

What about tolerating small mistakes aka typos? Let's try again. Search for fruits on the list but misspell them this time. Maybe aple instead of apple?

  • Apple
    Apple
  • Banana
    Banana
  • Blueberry
    Blueberry
  • Cherries
    Cherries
  • Lemon
    Lemon
  • Pear
    Pear
  • Pineapple
    Pineapple
  • Tomato
    Tomato

Aple, I mean Apple is still on the list, right? Try bananna, blubery, cheries, peer, and so on. I must admit, the algorithm is not auto-search friendly. The experience is much better with the [Search] button because you won't see false friends here while typing. But it's so much better for understanding how it works...

Let's try pee 🤭 for example. You should see Apple and Pear on the list. Both are pretty close matches according to the algorithm we are using.

The algorithm

The algorithm used here is called Levenshtein distance. I'm going to quote Wikipedia on this:

[...] the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

That's both a huge advantage and a problem sometimes. The shorter the names of searchable items are the worse for the algorithm. Very short words like Pear, are favored when typing because the number of edits needed to "having a match" will be relatively short comparing to a very long word that needs a lot of insertions.

As stated in the definition, in the very heart of this algorithm we calculate the distance. Then we decide if the distance is something we accept - so what's the minimum of edits that we accept? Let's visualize that and see how far words are from your searched text:

  • Apple
    Apple
  • Banana
    Banana
  • Blueberry
    Blueberry
  • Cherries
    Cherries
  • Lemon
    Lemon
  • Pear
    Pear
  • Pineapple
    Pineapple
  • Tomato
    Tomato

Let's go back to our embarrassing pee example 🤭. What you should see on the screen is Apple (3) and Pear (2). How the distance is measured? Please look below:

  pee
apple
12 3 
pee 
pear
  12

In the case of Apple we need to perform 3 operations to get there from "pee": add A and p and change the first e into l. When it comes to Pear, there are only 2 operations that have to be taken: change the second e into a and add r at the end. As you see it's easier to get Pear from the given input.

So far we were just keeping the order of items as it was (alphabetical here). But in fact, Pear is closer to what we need than Apple and that option should land as first on the list.

Fear not, we are just going to sort it! Give it a try now:

  • Apple
    Apple
  • Banana
    Banana
  • Blueberry
    Blueberry
  • Cherries
    Cherries
  • Lemon
    Lemon
  • Pear
    Pear
  • Pineapple
    Pineapple
  • Tomato
    Tomato

Implementation

So how it works? In a nutshell, we've just changed the searching/filtering algorithm (see highlighted lines).

const FRUITS = ['Apple', 'Banana', 'Blueberry', 'Cherries' /* etc... */];
const MIN_DISTANCE = 3;
function searchHandler(event) {
const searchText = event.target.value.toLowerCase();
const filteredFruits = FRUITS.filter((fruit) => {
const distance = levenshtein(fruit.toLowerCase(), searchText);
return distance <= MIN_DISTANCE;
});
// render the list of `filteredFruits`
}
function levenshtein(a, b) {
// The Levenshtein's algorithm calculating the distance
}

We compare distance by using Mr. Levenshtein's method and if the distance is higher than the minimal distance we accept we decide to filter these entries out.

When it comes to the algorithm itself, you may want to implement it on your own based on the definition on Wikipedia. But if there is anything I know about computing, is that there are methods way faster than what comes to your mind first, when you look at the mathematical equation.

It's better to just use what is already there on the Internet. Here's the implementation I used.

Perfect tolerance (distance)

I couldn't find any equation for that but my best guess is that the minimum tolerance (distance) that you should accept should be a bit smaller than the shortest word in your dataset. Otherwise, there is a possibility that this word will appear too often.

Hybrid approach

If you haven't noticed yet, I use a combination of %LIKE% match and Levenshtein's method. So we fall back to the latter method only if we don't have typical matches. That's handy because the "exact" match is probably what users want. They probably don't care about other variants of a searched text that could be considered as a "fixed" typo if they have exactly what they were looking for.

Is that a perfect method?

Well, it's not. Like most solutions, it doesn't have to be perfect. If it adds more value than can bring confusion (because of false friends in results sometimes) then it's useful.

Levenshtein's method is one of many for a given subject. If you'd like to see more experiments like that, let me know.

Bonus: Does Google do the same?

Nope. Their "Did you mean?" functionality in search is very different from this. As far as I know, they based it on us (the users) who correct queries when we can't find anything useful because of typos. This way, with the unbelievable amount of data they possess, they can teach the algorithm what's the best guess for given "typos". It's far more sophisticated but it can be super-efficient for long queries.

Anyway, for our front-end needs and as a first attempt to helping users with typos in search I think we are good enough with Levenshtein method. What do you think?