Fuzzy Scoring Regex Mayhem

Autocompletion is never an entirely solved problem. Can anyone really say what on earth a user typing "uni" into a country input field actually intends to select? It could match any of these:

Tanzania, [U][n][i]ted Republic of
[U][n][i]ted Arab Emirates
[U][n][i]ted Kingdom
[U][n][i]ted States
T[u][n][i]sia

Of course, it’s probably not the last one, but that right there is a human intuition that we often forget to instil into these UI interactions.

We can divine what the user probably intends most of the time but it'll always be a game of heuristics. Most solutions shy away from this game, opting instead to match the query letter-for-letter in each potential value, and this is usually sufficient, but without any other logic not only will “la” match “Latvia” but also “Angola”. And usually “Ltvia” will match nothing whatsoever, even though it’s seemingly obvious what the user is trying to type.

If you try implementing a fuzzy matcher to solve this, the first revelation is that you can't just boolean-match the query against the data like so many solutions do. You need to score each potential match. Hopefully, in the case of country selection, you end up with a sensible subset of countries that match the query to some reasonable degree. This scoring is necessary so that you know what you're putting at the top of the list. When typing "U", the user expects Ukraine or Uzbekistan sooner than Mauritius or Sudan, for example.

Oddly, if you looked at the most common autocompletion widget out there (jQuery UI), it doesn't appear to follow this intuition.

Even the most graceful solutions tend to avoid the muddiness of dealing with mistakes like “untied states” or “leichtenstein”. Sure, the likeliness of a person having to type the name of a country they aren’t intimately familiar with is probably quite low, but people still make mistakes.

I've been intrigued by this topic for quite a while and it's why I originally made relevancy.js. It solves the problem quite well, I think, and it does so in a pretty transparent way, with scores applied for various qualities such as the index of the query within the target string ("king" scores higher than "dom" in "kingdom", for example), but it's still a quite a lot of code for such a tiny part of an overall user experience.

I have once again been playing with this problem (thanks to a certain tweet) and have so wanted to come up with something stupefyingly graceful.

It all starts with a scratch in back of your mind — the one that tells you that your time has come. The world requires you to use regular expressions.

Warning: I don’t sincerely recommend doing any of this. It’s just a bit of fun. It’s probably an inefficient, unreliable, obscure and ultimately foolish endeavour!

Let’s begin!

A static France might look like this:

/^France$/

A more lenient France might be less sensitive to its case:

/^france$/i

We could then allow the characters to be optional too:

/^f?r?a?n?c?e?$/i

This would match “f” and “franc” and “FaE”, etc.

But… users make even more grievous mistakes sometimes, and our regular expression should be able to handle those. So let’s add a single character of leniency between each legitimate character, and at the beginning and end of the string:

/^.?f?.?r?.?a?.?n?.?c?.?e?.?$/i

But then this would allow contiguous mistakes like “fafafafa”. We only want to allow a *single* incorrect mistake after each successfully entered character. For this we can use groups to force each character to be matched and a lazy quantifier on the mistake character to ensure that legitimate characters get to successfully match.

So:

/f.?otherStuff/

Becomes:

/(?:f.??)?otherStuff/

In English: Try to match f followed by otherStuff. If impossible then try to match any character after f but before otherStuff. (This is why lazy quantifiers (e.g. ??) are so useful!)

The entire regex would become:

/^(?:.(?=f))?(?:f.??)?(?:r.??)?(?:a.??)?(?:n.??)?(?:c.??)?(?:e.??)?$/i

We should probably capture each individual match (f should be (f)) so that we can analyze the result and score it appropriately.

var r = /^(?:(f).??)?(?:(r).??)?(?:(a).??)?(?:(n).??)?(?:(c).??)?(?:(e).??)?$/i
 
'f$R-aN_cEx'.match(r);
// => ["f$R-aN_cEx", "f", "R", "a", "N", "c", "E"]

The regular expression, broken down:

/
  ^       # Start of string
 
  (?:     # Non-captured group
    (f)   # Match and capture 'f'
    .??   # Followed lazily by any character
  )?      # Entire group is optional
 
  (?:     # Non-captured group
    (r)   # Match and capture 'f'
    .??   # Followed lazily by any character
  )?      # Entire group is optional
 
  ...     # Etc.
 
  $       # End of string
/i

A quick note: lazy or lazily in the context of regular expressions simply means that that thing will be intentionally excluded from the first match attempt and will only be used if the subsequent regular expression is unsuccessful without it.

One caveat with the above regex is that it doesn’t allow a mistake to be at the beginning of the string. We could fix this with a lookahead to the effect of “allow a mistake here as long as its followed by a non-mistake” but since “non-mistake” could effectively be any character in the legitimate string it’s easier to just make allowances for that initial mistake in each group. Additionally, we probably want to capture every single mistake, in addition to legitimate characters. Here’s our next iteration:

/
  ^         # Start of string
 
  (?:       # Non-captured group
 
    (^.)?   # Captured optional mistake at the beginning of the string
            # ===============================================
 
    (f)     # Match and capture 'f'
    (.??)   # Followed lazily by any character (captured)
  )?        # Entire group is optional
 
  ...     # Etc.
 
  $       # End of string
/i

The check (^.)? has to be specified in each group, to account for mistakes that don’t involve “f”, like “krance” or “ttance”, etc.

Since we’re aiming to genericize this entire mess, we should create a generator that assembles the regular expression given any piece of text:

function makeFuzzyRegex(string) {
 
  if (!string) { return /^$/; }
 
  // Escape any potential special characters:
  var cleansed = string.replace(/\W/g, '\\$&');
 
  return RegExp(
    '^' +
      cleansed.replace(
        // Find every escaped and non-escaped char:
        /(\\?.)/g,
        // Replace with fuzzy character matcher:
        '(?:(^.)?($1)(.??))?'
      ) +
    '$',
    'i'
  );
}
 
makeFuzzyRegex('omg');
// => /^(?:(^.)?(o)(.??))?(?:(^.)?(m)(.??))?(?:(^.)?(g)(.??))?$/i

This regex matched against ‘_o-m*g!’ produces:

[
  // Full match:
  "_o-m*g!",
 
  // Captures:
  "_",           // Mistake
  "o",           // Legit
  "-",           // Mistake
 
  undefined,     // Void mistake
  "m",           // Legit
  "*",           // Mistake
 
  undefined,     // Void mistake
  "g",           // Legit
  "!"            // Mistake
]

The captures are in groups of three, with every second capture being the legitimate character (case-insensitive), and with every first and third potentially being mistakes.

We can then loop through these captures and apply weights as we see fit.

var fullMatch = makeFuzzyRegex('omg').exec('_o-m*g!');
var captures = fullMatch.slice(1); // Get captures specifically
var score = 0;
 
for (var i = 0, l = captures.length; i < l; i += 3) {
  if (captures[i]) score -= 1;
  if (captures[i+1]) score += 10;
  if (captures[i+2]) score -= 1;
}
 
score; // => 26

That scoring is quite arbitrary, but we’ve at least prescribed our wish to score successes more than we punish mistakes (10 vs 1).

We can start to play with the heuristics of this if we wrap it all up:

function createFuzzyScorer(text) {
 
  var matcher = makeFuzzyRegex(text);
 
  return function(query) {
    var match = matcher.exec(query);
 
    if (!match) return 0;
 
    var captures = match.slice(1);
    var score = 0;
 
    for (var i = 0, l = captures.length; i < l; i += 3) {
      if (captures[i]) score -= 1;
      if (captures[i+1]) score += 10;
      if (captures[i+2]) score -= 1;
    }
 
    return score;
  };
 
  function makeFuzzyRegex(string) {
 
    if (!string) { return /^$/; }
 
    // Escape any potential special characters:
    var cleansed = string.replace(/\W/g, '\\$&');
 
    return RegExp(
      '^' +
        cleansed.replace(
          // Find every escaped and non-escaped char:
          /(\\?.)/g,
          // Replace with fuzzy character matcher:
          '(?:(^.)?($1)(.??))?'
        ) +
      '$',
      'i'
    );
  }
}

Our first attempt isn’t too bad:

var score = createFuzzyScorer('omg');
 
score('omg');     // => 30
score('xOmg');    // => 29
score('.o.m.g.'); // => 26
score('om');      // => 20
score('og');      // => 20
score('o');       // => 10
score('nope');    // => 0

These seem like sensible enough scores, generally, but we’re more interested in autocompletion, and so there’s an obvious predictive element there. If a user types ‘o’ then that should probably score higher than ‘g’ if we’re testing against ‘omg’, but with the above mechanism they both receive a standard 10:

var score = createFuzzyScorer('omg');
 
score('o'); // => 10
score('g'); // => 10

We can fix this by applying a higher weight to matches that appear earlier in the string:

// The scoring loop:
for (var i = 0, l = captures.length; i < l; i += 3) {
  if (captures[i]) score -= 0.1;
  if (captures[i+1]) score += (l - i) / l; // the magic
  if (captures[i+2]) score -= 0.1;
}

Now the score given for any singular legitimate match will decrease as the index (i) increases. Here are the results:

var score = createFuzzyScorer('omg');
 
score('omg');     // => 1.99
score('xOmg');    // => 1.90
score('om');      // => 1.66
score('.o.m.g.'); // => 1.59
score('og');      // => 1.33
score('o');       // => 1.00
score('nope');    // => 0.00

This is getting closer to our intuition. The next step would be to try to create a real autocompletion widget. I’ve done it so I know that we’ll want to make one more change. The problem with our scoring right now is that it’ll award legitimate characters relative to the length of the string. But when comparing scores across multiple subject strings, this approach seems broken.

createFuzzyScorer('RuneScript')('Ru'); // 1.9
createFuzzyScorer('Ruby')('Ru');       // 1.7

These should both score equally, as “Ru” is just as likely to become “Ruby” as it is to become “RuneScript”. To achieve this we should only take into account the index, and make the weight of any scoring decision inversely proportional to that index, in this case via an exponential taper (pow(index, -2)).

// The scoring loop:
for (var i = 0, l = captures.length; i < l; i += 3) {
  var relevancyOfCharacter = Math.pow(i + 1, -2);
  if (captures[i]) score -= relevancyOfCharacter * 0.1;
  if (captures[i+1]) score += relevancyOfCharacter * 1;
  if (captures[i+2]) score -= relevancyOfCharacter * 0.1;
}

(Final version of createFuzzyScorer available as a gist.)

See this demo using programming languages as the dataset. Try intentionally misspelling something (jawascript), or missing out characters (jaascit), or just going a little crazy (jahskt). It works beautifully.

To achieve speedy sorting, a fuzzy scorer is created for every single value before the user types anything:

var data = PROGRAMMING_LANGUAGES.map(function(lang, i) {
  return {
    actualValue: lang,
    score: createFuzzyScorer(lang),
    i: i,
    toString: function() { return lang; }
  };
});

This means we can iterate through data on every relevant input event, and call the score() method with the current query. We can then bundle this into a filter->sort->slice flow to get our list of sensible suggestions:

var sorted = data.filter(function(item) {
 
  // Get rid of any very unlikely matches (and cache the score!)
  return (item._cachedScore = item.score(query)) >= .5;
 
}).sort(function(a, b) {
 
  var as = a._cachedScore;
  var bs = b._cachedScore;
 
  // Sort by score, and if score is equal, then by original index:
  // (We would typically return 0 in that case but some engines don't stable-sort)
  return as > bs ? -1 : as == bs && a.i < b.i ? -1 : 1;
 
}).slice(0, 10); // We only need the top 10...

And.. we’re done. It’s never really finished though: you’ll find endless tweaks that can be made to the scorer to make it more believably resemble human-like intuition.

For those wanting to test the resulting country autocompletion interaction: See the demo.

I guess, despite my initial warning, I wouldn’t actually mind using this in production, as long as there were a decent number of unit tests. I’d probably also assemble the regular expressions on the server and serve them up as literals. It’s also worth mentioning that almost everything in this post has been exploring the fuzzy-matching of very short strings in small datasets. Even in the case of the country demo, to get more applicable results, I broke up long names into the component parts and then scored against each. E.g.

// E.g. Macedonia, the Former Yugoslav Republic of:
var scorers = [
  "Macedonia, the Former Yugoslav Republic of",
  "Macedonia",
  "the",
  "former",
  "yugoslav",
  "republic",
  "of"
].map(createFuzzyScorer);
// Etc.

And this would be terribly inefficient on a larger scale, so with any dataset longer than a list of countries you’re probably best to explore Trie-based approaches to autocompletion.

And with that, I’ll shut-up and wish you merry regex’ing!

Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!

Liad March 8th, 2015 at 12:12 am

Great article! Very interesting approach.

I played with the demo a bit, and I think that adding a “punishment” for text too long or too short can help:

score-=Math.abs(query.length-text.length) * 0.05;

Without it, testing “pqrl” won’t suggest “Perl”, but with this addition it will.

Again – great read.

james March 9th, 2015 at 8:31 am

@Liad, Definitely worth adding, good idea!

Fred March 9th, 2015 at 11:23 am

I do think the most accurate solution comes from my phone ( which is a Sony ) and what it does is that it takes the record from your
texting to autocomplete. Therefor a self learning algorithm should be much more efficient

Gaurang March 17th, 2015 at 12:12 pm

Hello James, A very nice guide to fuzzy matching. I had once seen fuzzy matching somewhere and longed to know how it works. Glad that I know now!

But one thing, I don’t exactly still get what does /(?:f.??)?otherStuff/ do; and how it is different from /(?:f.?)?otherStuff/ (single question mark). Can you please explain the difference between the two? Thanks!

james March 17th, 2015 at 2:23 pm

@Gaurang, glad the article was informative! Note that this is just one way to do it — and will probably only work well for small datasets.

Regarding your question, the ?? is a lazy (or “reluctant” / “non-greedy”) quantifier, while the single question mark in .? is a greedy/possessive quantifier. This seems like a pretty good description of the difference between greedy and lazy: https://msdn.microsoft.com/en-us/library/3206d374%28v=vs.110%29.aspx#Greedy

In the example /(?:f.??)?otherStuff/, the . (which represents any character) will only be matched if it has to be in order for the regex to pass. A simpler illustration of this:

/foo.?[A-Z]/.exec('fooAB')  // ["fooAB"]
/foo.??[A-Z]/.exec('fooAB') // ["fooA"]

In the second one we’re using a lazy quantifier (.??). The regular expression can successfully match without having to use .??, and so, it simply skips it. In the first one the .? is greedy and so is matched.

Hope that helps.

Gaurang March 19th, 2015 at 1:54 am

I get it! Thanks!

Mark March 19th, 2015 at 4:52 pm

You have a problem. You try to solve the problem with regexp and now you have two problems.

I think you guys should give more attention to algorithms from the past that solve these kinds of problems already.
I.e. en.m.wikipedia.org/wiki/Levenshtein_distance

Some cool Google results:
https://github.com/gf3/Levenshtein
phpjs.org/functions/similar_text

james March 19th, 2015 at 4:56 pm

Totally agree, Mark.

Keep in mind this post was only intended as a bit of fun, as mentioned…

Hassan M Bashir March 22nd, 2015 at 3:39 pm

Great post buddy

Fuzzy Scoring Regex Mayhem

Let’s begin!

So far there's been 9 Responses to “Fuzzy Scoring Regex Mayhem”

So far there's been 9 Responses to
“Fuzzy Scoring Regex Mayhem”