JavaScript – James Padolsey

Fuzzy Scoring Regex Mayhem

james — Sat, 07 Mar 2015 23:11:36 +0000

Autocompletion is never an entirely solved problem. Can anyone really say what on earth a user typing "uni" into a country input field actually intends to select? It could match any of these:

Tanzania, [U][n][i]ted Republic of
[U][n][i]ted Arab Emirates
[U][n][i]ted Kingdom
[U][n][i]ted States
T[u][n][i]sia

Of course, it’s probably not the last one, but that right there is a human intuition that we often forget to instil into these UI interactions.

We can divine what the user probably intends most of the time but it'll always be a game of heuristics. Most solutions shy away from this game, opting instead to match the query letter-for-letter in each potential value, and this is usually sufficient, but without any other logic not only will “la” match “Latvia” but also “Angola”. And usually “Ltvia” will match nothing whatsoever, even though it’s seemingly obvious what the user is trying to type.

If you try implementing a fuzzy matcher to solve this, the first revelation is that you can't just boolean-match the query against the data like so many solutions do. You need to score each potential match. Hopefully, in the case of country selection, you end up with a sensible subset of countries that match the query to some reasonable degree. This scoring is necessary so that you know what you're putting at the top of the list. When typing "U", the user expects Ukraine or Uzbekistan sooner than Mauritius or Sudan, for example.

Oddly, if you looked at the most common autocompletion widget out there (jQuery UI), it doesn't appear to follow this intuition.

Even the most graceful solutions tend to avoid the muddiness of dealing with mistakes like “untied states” or “leichtenstein”. Sure, the likeliness of a person having to type the name of a country they aren’t intimately familiar with is probably quite low, but people still make mistakes.

I've been intrigued by this topic for quite a while and it's why I originally made relevancy.js. It solves the problem quite well, I think, and it does so in a pretty transparent way, with scores applied for various qualities such as the index of the query within the target string ("king" scores higher than "dom" in "kingdom", for example), but it's still a quite a lot of code for such a tiny part of an overall user experience.

I have once again been playing with this problem (thanks to a certain tweet) and have so wanted to come up with something stupefyingly graceful.

It all starts with a scratch in back of your mind — the one that tells you that your time has come. The world requires you to use regular expressions.

Warning: I don’t sincerely recommend doing any of this. It’s just a bit of fun. It’s probably an inefficient, unreliable, obscure and ultimately foolish endeavour!

Let’s begin!

A static France might look like this:

/^France$/

A more lenient France might be less sensitive to its case:

/^france$/i

We could then allow the characters to be optional too:

/^f?r?a?n?c?e?$/i

This would match “f” and “franc” and “FaE”, etc.

But… users make even more grievous mistakes sometimes, and our regular expression should be able to handle those. So let’s add a single character of leniency between each legitimate character, and at the beginning and end of the string:

/^.?f?.?r?.?a?.?n?.?c?.?e?.?$/i

But then this would allow contiguous mistakes like “fafafafa”. We only want to allow a *single* incorrect mistake after each successfully entered character. For this we can use groups to force each character to be matched and a lazy quantifier on the mistake character to ensure that legitimate characters get to successfully match.

So:

/f.?otherStuff/

Becomes:

/(?:f.??)?otherStuff/

In English: Try to match f followed by otherStuff. If impossible then try to match any character after f but before otherStuff. (This is why lazy quantifiers (e.g. ??) are so useful!)

The entire regex would become:

/^(?:.(?=f))?(?:f.??)?(?:r.??)?(?:a.??)?(?:n.??)?(?:c.??)?(?:e.??)?$/i

We should probably capture each individual match (f should be (f)) so that we can analyze the result and score it appropriately.

var r = /^(?:(f).??)?(?:(r).??)?(?:(a).??)?(?:(n).??)?(?:(c).??)?(?:(e).??)?$/i
 
'f$R-aN_cEx'.match(r);
// => ["f$R-aN_cEx", "f", "R", "a", "N", "c", "E"]

The regular expression, broken down:

/
  ^       # Start of string
 
  (?:     # Non-captured group
    (f)   # Match and capture 'f'
    .??   # Followed lazily by any character
  )?      # Entire group is optional
 
  (?:     # Non-captured group
    (r)   # Match and capture 'f'
    .??   # Followed lazily by any character
  )?      # Entire group is optional
 
  ...     # Etc.
 
  $       # End of string
/i

A quick note: lazy or lazily in the context of regular expressions simply means that that thing will be intentionally excluded from the first match attempt and will only be used if the subsequent regular expression is unsuccessful without it.

One caveat with the above regex is that it doesn’t allow a mistake to be at the beginning of the string. We could fix this with a lookahead to the effect of “allow a mistake here as long as its followed by a non-mistake” but since “non-mistake” could effectively be any character in the legitimate string it’s easier to just make allowances for that initial mistake in each group. Additionally, we probably want to capture every single mistake, in addition to legitimate characters. Here’s our next iteration:

/
  ^         # Start of string
 
  (?:       # Non-captured group
 
    (^.)?   # Captured optional mistake at the beginning of the string
            # ===============================================
 
    (f)     # Match and capture 'f'
    (.??)   # Followed lazily by any character (captured)
  )?        # Entire group is optional
 
  ...     # Etc.
 
  $       # End of string
/i

The check (^.)? has to be specified in each group, to account for mistakes that don’t involve “f”, like “krance” or “ttance”, etc.

Since we’re aiming to genericize this entire mess, we should create a generator that assembles the regular expression given any piece of text:

function makeFuzzyRegex(string) {
 
  if (!string) { return /^$/; }
 
  // Escape any potential special characters:
  var cleansed = string.replace(/\W/g, '\\$&');
 
  return RegExp(
    '^' +
      cleansed.replace(
        // Find every escaped and non-escaped char:
        /(\\?.)/g,
        // Replace with fuzzy character matcher:
        '(?:(^.)?($1)(.??))?'
      ) +
    '$',
    'i'
  );
}
 
makeFuzzyRegex('omg');
// => /^(?:(^.)?(o)(.??))?(?:(^.)?(m)(.??))?(?:(^.)?(g)(.??))?$/i

This regex matched against ‘_o-m*g!’ produces:

[
  // Full match:
  "_o-m*g!",
 
  // Captures:
  "_",           // Mistake
  "o",           // Legit
  "-",           // Mistake
 
  undefined,     // Void mistake
  "m",           // Legit
  "*",           // Mistake
 
  undefined,     // Void mistake
  "g",           // Legit
  "!"            // Mistake
]

The captures are in groups of three, with every second capture being the legitimate character (case-insensitive), and with every first and third potentially being mistakes.

We can then loop through these captures and apply weights as we see fit.

var fullMatch = makeFuzzyRegex('omg').exec('_o-m*g!');
var captures = fullMatch.slice(1); // Get captures specifically
var score = 0;
 
for (var i = 0, l = captures.length; i < l; i += 3) {
  if (captures[i]) score -= 1;
  if (captures[i+1]) score += 10;
  if (captures[i+2]) score -= 1;
}
 
score; // => 26

That scoring is quite arbitrary, but we’ve at least prescribed our wish to score successes more than we punish mistakes (10 vs 1).

We can start to play with the heuristics of this if we wrap it all up:

function createFuzzyScorer(text) {
 
  var matcher = makeFuzzyRegex(text);
 
  return function(query) {
    var match = matcher.exec(query);
 
    if (!match) return 0;
 
    var captures = match.slice(1);
    var score = 0;
 
    for (var i = 0, l = captures.length; i < l; i += 3) {
      if (captures[i]) score -= 1;
      if (captures[i+1]) score += 10;
      if (captures[i+2]) score -= 1;
    }
 
    return score;
  };
 
  function makeFuzzyRegex(string) {
 
    if (!string) { return /^$/; }
 
    // Escape any potential special characters:
    var cleansed = string.replace(/\W/g, '\\$&');
 
    return RegExp(
      '^' +
        cleansed.replace(
          // Find every escaped and non-escaped char:
          /(\\?.)/g,
          // Replace with fuzzy character matcher:
          '(?:(^.)?($1)(.??))?'
        ) +
      '$',
      'i'
    );
  }
}

Our first attempt isn’t too bad:

var score = createFuzzyScorer('omg');
 
score('omg');     // => 30
score('xOmg');    // => 29
score('.o.m.g.'); // => 26
score('om');      // => 20
score('og');      // => 20
score('o');       // => 10
score('nope');    // => 0

These seem like sensible enough scores, generally, but we’re more interested in autocompletion, and so there’s an obvious predictive element there. If a user types ‘o’ then that should probably score higher than ‘g’ if we’re testing against ‘omg’, but with the above mechanism they both receive a standard 10:

var score = createFuzzyScorer('omg');
 
score('o'); // => 10
score('g'); // => 10

We can fix this by applying a higher weight to matches that appear earlier in the string:

// The scoring loop:
for (var i = 0, l = captures.length; i < l; i += 3) {
  if (captures[i]) score -= 0.1;
  if (captures[i+1]) score += (l - i) / l; // the magic
  if (captures[i+2]) score -= 0.1;
}

Now the score given for any singular legitimate match will decrease as the index (i) increases. Here are the results:

var score = createFuzzyScorer('omg');
 
score('omg');     // => 1.99
score('xOmg');    // => 1.90
score('om');      // => 1.66
score('.o.m.g.'); // => 1.59
score('og');      // => 1.33
score('o');       // => 1.00
score('nope');    // => 0.00

This is getting closer to our intuition. The next step would be to try to create a real autocompletion widget. I’ve done it so I know that we’ll want to make one more change. The problem with our scoring right now is that it’ll award legitimate characters relative to the length of the string. But when comparing scores across multiple subject strings, this approach seems broken.

createFuzzyScorer('RuneScript')('Ru'); // 1.9
createFuzzyScorer('Ruby')('Ru');       // 1.7

These should both score equally, as “Ru” is just as likely to become “Ruby” as it is to become “RuneScript”. To achieve this we should only take into account the index, and make the weight of any scoring decision inversely proportional to that index, in this case via an exponential taper (pow(index, -2)).

// The scoring loop:
for (var i = 0, l = captures.length; i < l; i += 3) {
  var relevancyOfCharacter = Math.pow(i + 1, -2);
  if (captures[i]) score -= relevancyOfCharacter * 0.1;
  if (captures[i+1]) score += relevancyOfCharacter * 1;
  if (captures[i+2]) score -= relevancyOfCharacter * 0.1;
}

(Final version of createFuzzyScorer available as a gist.)

See this demo using programming languages as the dataset. Try intentionally misspelling something (jawascript), or missing out characters (jaascit), or just going a little crazy (jahskt). It works beautifully.

To achieve speedy sorting, a fuzzy scorer is created for every single value before the user types anything:

var data = PROGRAMMING_LANGUAGES.map(function(lang, i) {
  return {
    actualValue: lang,
    score: createFuzzyScorer(lang),
    i: i,
    toString: function() { return lang; }
  };
});

This means we can iterate through data on every relevant input event, and call the score() method with the current query. We can then bundle this into a filter->sort->slice flow to get our list of sensible suggestions:

var sorted = data.filter(function(item) {
 
  // Get rid of any very unlikely matches (and cache the score!)
  return (item._cachedScore = item.score(query)) >= .5;
 
}).sort(function(a, b) {
 
  var as = a._cachedScore;
  var bs = b._cachedScore;
 
  // Sort by score, and if score is equal, then by original index:
  // (We would typically return 0 in that case but some engines don't stable-sort)
  return as > bs ? -1 : as == bs && a.i < b.i ? -1 : 1;
 
}).slice(0, 10); // We only need the top 10...

And.. we’re done. It’s never really finished though: you’ll find endless tweaks that can be made to the scorer to make it more believably resemble human-like intuition.

For those wanting to test the resulting country autocompletion interaction: See the demo.

I guess, despite my initial warning, I wouldn’t actually mind using this in production, as long as there were a decent number of unit tests. I’d probably also assemble the regular expressions on the server and serve them up as literals. It’s also worth mentioning that almost everything in this post has been exploring the fuzzy-matching of very short strings in small datasets. Even in the case of the country demo, to get more applicable results, I broke up long names into the component parts and then scored against each. E.g.

// E.g. Macedonia, the Former Yugoslav Republic of:
var scorers = [
  "Macedonia, the Former Yugoslav Republic of",
  "Macedonia",
  "the",
  "former",
  "yugoslav",
  "republic",
  "of"
].map(createFuzzyScorer);
// Etc.

And this would be terribly inefficient on a larger scale, so with any dataset longer than a list of countries you’re probably best to explore Trie-based approaches to autocompletion.

And with that, I’ll shut-up and wish you merry regex’ing!

What is React?

james — Tue, 31 Dec 2013 00:00:49 +0000

In the constant flurry of JavaScript MVC and MVVM drama I missed the release of React earlier this year, or rather I ignored the release, most probably dismissing it as yet another spin on expressive two-way binding. After reading "The Future of JavaScript MVC frameworks" I decided to have a better look and I’m very glad I did… React is awesome!

So what is React? It’s a JavaScript library for building UIs.

On the face of it, it looks a tad… fanciful:

/** @jsx React.DOM */
var HelloMessage = React.createClass({
  render: function() {
    return <div>{'Hello ' + this.props.name}div>;
  }
});
 
React.renderComponent(<HelloMessage name="John" />, mountNode);

But React is more than it seems. In fact it's quite foundational in what it offers. It’s trying to solve the problem of the DOM in a refreshingly novel way: by completely mocking it and only touching the real thing when it needs to.

React Components provide a render method, which returns a virtual DOM structure which is, upon state changes, reconciled against the real DOM — and only the minimal set of DOM manipulations will occur in order to actualise your changes. How wonderful! It's so clean.

It’s perfectly logical, really. We know that DOM mutations are slow. Having a virtual DOM, which you can easily diff against newer versions of itself, and which can then be reflected in the actual DOM seems to make utter sense. This is what we should have started doing five years ago.

What I like most about React is that it doesn't impose heady design patterns and data-modelling abstractions on me. I am someone that'll probably never buy into someone else's take on the "ideal" framework. I'd rather piece together my own precarious stack of cards. And React allows me this small joy. Its opinions are so minimal and its abstractions so focused on the problem of the DOM, that you can merrily slap your design choices atop.

But what about stuff like this? –

//...
  render: function() {
    return <div>{'Hello ' + this.props.name}div>;
  }

Those literal XML chunks are pieces of JSX. If this scares you, React does provide straight-up APIs for DOM generation:

//...
  render: function() {
    return React.DOM.div({}, 'Hello ' + this.props.name);
  }

You might think this all looks like a step backwards in expressiveness and cleanliness. But it’s really a big step forwards, for it means we no longer need to bother with data-* annotations and physical DOM attributes bidirectionally tied to pieces of data.

It’s well worth checking out!

Straight-up interpolation

james — Mon, 12 Aug 2013 19:12:52 +0000

For basic interpolation a full-blown templating engine seems a bit overkill. Even the “Embedded JS” variants (like Underscore’s & LoDash’s) are not ideal when all you need is simple value interpolation, i.e.

"some/long/url/{userId}/{actionName}?mode={mode}"

The first thing to understand is that runtime performance is key. Doing a match for {...} at runtime is wasteful. We can do it just once, at the beginning and generate a simpler runtime process. This is how most template engines now work, sacrificing creation-time performance for run-time performance.

Resig’s micro-templating approach and other EJS interpolators tend to generate a Function that builds an expression of many concatenations. LoDash, for example, returns the following generated function from _.template('Oh <%=wow%>'):

// Generated Function:
function (obj) {
  obj || (obj = {});
  var __t, __p='', __e=_.escape;
  with (obj) {
    __p += 'Oh ' + ((__t=(wow)) == null ? '' : __t);
  }
  return __p;
}

The wow is treated as a JS expression which is evaluated at runtime within a with statement — which means it can directly access properties of the passed object. It works very well and, its flexibility considered, performance is very good.

For the specialised case of just interpolating known property names we can optimise further. One can ask: What is the fastest way to write the function doInterpolation so that these cases are fulfilled:

doInterpolation({
    foo: 123,
    bar: 456
}); // => 'Foo is 123 ... Bar is 456'
 
doInterpolation({
    foo: 999,
    bar: 111
}); // => 'Foo is 999 ... Bar is 111'

And one would struggle to come up with a more performant function than this:

function doInterpolation(obj) {
  return 'Foo is ' + obj.foo + ' ... Bar is ' + obj.bar;
}

So, given that, in a generic interpolator-generator we should endeavour to generate a function employing the same direct concatenation approach, and that’s what this monstrosity does:

GIST: 6008842

/**
 * https://gist.github.com/padolsey/6008842
 * Outputs a new function with interpolated object property values.
 * Use like so:
 *   var fn = makeInterpolator('some/url/{param1}/{param2}');
 *   fn({ param1: 123, param2: 456 }); // => 'some/url/123/456'
 */
var makeInterpolator = (function() {
  var rc = {
    'n': '\n', '"': '\"',
    'u2028': '\u2028', 'u2029': '\u2029'
  };
  return function makeInterpolator(str) {
    return new Function(
      'o',
      'return "' + (
        str
        .replace(/["nru2028u2029]/g, function($0) {
          return rc[$0];
        })
        .replace(/{([sS]+?)}/g, '" + o["$1"] + "')
      ) + '";'
    );
  };
}());

The generated function from makeInterpolator('Oh {wow}') would be:

function (o) {
  return "Oh " + o["wow"] + "";
}

Apart from the closing + "" (I’m calling it a remnant of efficiency), that generated function is the most performant technique to interpolate the varying o.wow value into a pre-defined string. It uses square-bracket notation to access the property values instead of the dot notation just in-case someone ends up using a reserved-word (e.g. `instanceof`) or an invalid JS identifier.

The curly-matching regular expression, /{([sS]+?)}/g, can be modified to your heart’s content. You might, for example, prefer the double-curly {{foo}} or the dollar, $foo.

There are additional features that you could add, such as HTML escaping or echoing empty strings in cases of undefined values… but I think adding more complexity would warrant a proper test-suite, and with each added feature you generalise further and eventually end up with something that has no notable benefit over a generic templating engine.

Abstracting The Web Worker: Operative

james — Thu, 18 Jul 2013 20:55:55 +0000

HTML5 Web Workers are the web’s answer to multi-threading. They’ve been around for a while now, so are pretty safe to rely on. A Worker is typically initialised by instantiating Worker with the URL of your ‘worker script’:

var myWorker = new Worker('path/to/my/worker.js');

And then you’d interface with the worker via asynchronous message-passing:

// == Within parent page ==
myWorker.postMessage('Did you receive this?');
myWorker.onmessage = function() {
  if (e.data === 'I sure did, friend.') {
    myWorker.postMessage('Cool, I am gonna send you some data, k?');
  }
  // etc.
}
 
// == Within worker ==
self.onmessage = function(e) {
  if (e.data === 'Did you receive this?') {
    self.postMessage('I sure did, friend.');
  }
  // etc.
};

The message-passing code I tend to see in the wild is typically un-abstracted and chaotic. I felt there was some things worth improving in this area so I created operative.

Operative allows you to define worker code inline without having to create specific files and load them separately at runtime. With operative it’s as simple as:

var calculator = operative({
 
  // Any methods defined here are executed within a worker!
 
  doComplexThing: function(a, b, c) {
    // do stuff.
    return result;
  }
 
});
 
// Call method, passing a callback as the last arg:
calculator.doComplexThing(1, 2, 3, function(result) {
  result; // => value returned from within worker
});

Additional benefits include:

Debuggability (Blob + Chromium Web Inspector = Pure Bliss)
Console debug methods for free (log, warn, debug etc.)
Degradability / Progressive Enhancement (It works with no/partial worker support)

Operative’s Degradability

Operative degrades in this order:

(higher is better/cooler)

Full Worker via Blob & Structured-Cloning (Ch13+, FF8+, IE11+, Op11.5+, Sf5.1+)
Full Worker via Eval & Structured-Cloning (IE10)
Full Worker via Blob & JSON marshalling (???)
Full Worker via Eval & JSON marshalling (Sf4)
No Worker: Regular JS called inline (older browsers = slow)

Operative will degrade in environments with no Worker support. In such a case the code would execute as regular in-place JavaScript. The calls will still be asynchronous though, not immediate.

Leaky but worth it?

The abstraction that operative offers is somewhat leaky. I see this is an acceptable leakiness. I would hope that anyone using operative understands what’s really happening behind the scenes.

One particular leak is the lack of the native worker APIs, such as importScripts, in the fully-degraded state which runs JavaScript in-place. If you’re just aiming for newer browsers then you can use importScripts to your heart’s content though.

Other leaks are the subtle differences you could encounter between structured-cloning and the degraded JSON-transfer approach. E.g. one understands /regexes/ and the other doesn’t. Operative uses JSON stringify/parse where structured-cloning is unavailable.

Basically, short of creating a new DSL with severely limited possibilities, I very much doubt it’s even possible to create a non-leaky worker abstraction. And I think that’s ok. Operative isn’t perfection. It’s just a tad better and a tad more seamless than the conventional way of utilising workers… And that is sufficient benefit for me to feel okay with using it.

Other options

I wasn’t even going to make operative degradable until Calvin Metcalf mentioned his similar-in-principle communistjs library which does degrade. Seeing it as a possibility I decided to implement JSON-transfer fallbacks and an eval option so that IE10 (the browser of rubbish worker/blob support) could be used.

I suggest checking out both communistjs library and operative if you’re looking to abstract away Worker awkwardness.

Also, watch out, operative is new and potentially buggy. It’s also a bit of an experiment. I’m not even sure this kind of thing is worth abstracting.

Cargo-culting in JavaScript

james — Fri, 24 May 2013 23:07:41 +0000

I recently wrote an article for Nettuts+, “Cargo-Culting in JavaScript“. In it I cover the concept of cargo-cult programming and specific techniques that seem to be cargo-culted off quite frequently in JavaScript.

I found that writing the article took a great deal longer than I’d anticipated. Writing an opinionated piece is always a little risky and making the prose itself measured and reasonable is hard work.

Rereading it I realise it seems a little preachy. I think that may just be the nature of the principle behind the article though.

What’s quite funny is that regular tutorials and code samples inject the same amount of opinion and subjective preference but it is under the guise of code and so you don’t feel manipulated or provoked. Instead you feel appreciative and subdued. This is in contrast to prose… it only takes a bit and you’ll feel preached-to.

Code, on the other hand, seems factual and robotic. You can hardly debate a piece of code (if you try you’re either a “nitpicker” or “troll”). But shared code is far more pervasive a manipulator than we’d like to think. If it wasn’t then cargo-cult-programming wouldn’t exist. Techniques, both good and bad, spread like wildfire.

Sonic & The State Of Spinners

james — Sun, 31 Mar 2013 18:35:06 +0000

Over a year ago I released Sonic, a JavaScript Canvas utility for making loading spinners. I’ve used it in a couple of my own projects and was pleased with the result but I quickly became aware that others may not be happy to:

Include a 1.5k JS utility + specific sonic configuration
Depend on HTML5 Canvas: lacking browser support & potential performance costs

Not long after releasing Sonic, Github user cadc made SonicGIF, introducing me to a novel concept — converting images in an HTML5 Canvas to an animated GIF on the fly using jsgif.

Fast forward to this week and I’ve been working on a live Sonic editor so you can create spinners using JavaScript, immediately have them converted to GIF, and even a PNG Sprite (example), for you to use via a CSS3 steps Animation.

The spinner you use should be tailored to your needs though. This particular niche of graphic animation is split between various different techniques, each with merits of their own, so choose wisely…

State of spinners 2013

GIF
- Good: simple, widely supported
- Bad: limited FPS (depends on browser), 256 colours, difficult to change
APNG
- Good: More colors than GIF, 8-bit transparency
- Bad: Support very limited, difficult to change
SVG via , etc.
- Good: Vector based, easily editable
- Bad: No IE support, continually recalculated ^[1]
SVG via JavaScript
- Good: Vector based, easily editable, widely supported
- Bad: Only IE9+, continually recalculated ^[1]
Canvas via JavaScript
- Good: Widely supported, easily editable, pixel control, can cache rendered frames
- Bad: Only IE9+, can be slow if frames aren’t cached
Sprite animated via JavaScript (i.e. setInterval + mutating backgroundPosition)
- Good: Widely supported, relatively quick
- Bad: Little opportunity for browser to optimise, difficult to change
DOM+CSS Animated via JavaScript
- Good: Widely supported, easily editable
- Bad: Little opportunity for browser to optimise, continually recalculated ^[1]
Sprite animated via CSS3
- Good: GPU acceleration (depending on browser, device)
- Bad: Difficult to edit (Sprite), IE9 and below not supported
DOM+CSS Animated via CSS3
- Good: GPU acceleration (depending on browser, device), easily editable
- Bad: IE9 and below not supported

[1]: By “continually recalculated”, I mean that the animated property values need to be recalculated on each frame and then the corresponding graphic needs to be drawn (even if you cache property values, individual elements need to be drawn for each frame). As far as I know this is unavoidable with traditional JavaScript DOM Animations (including SVG Animations). This is in contrast to Canvas, for example, where you have the opportunity to cache rendered frames and simply re-run them indefinitely.

The graphical capabilities of each technique should also be taken in account. If it’s a straightforward circling snake or dots then you’re probably best going with either SVG or DOM+CSS. If, on the other hand, it’s a more complex animation (e.g. gradients, blur, unique pathing) you may want to venture into Canvas territory or even developing your own PNG Sprite in Photoshop (or Sonic Creator!!). That said, a straightforward GIF may be the best solution if you’re not too fussed about 256 colours, lack of alpha transparency or a lower FPS.

Performance

From limited testing the most performant spinners are those using CSS3, either to animate DOM elements, or to animate through a PNG Sprite. Of course, a single spinner on a page will cost very little, so it’s not necessarily something you need to worry about.

Resources

For CSS+DOM Animation where all you want is a basic spinner (i.e. lines arranged in a circular fashion pulsating with frequency) I suggest using spin.js which utilises CSS3 Animations and falls-back to VML in older versions of IE.

If you do need more control and are happy using either HTML5 Canvas, GIFs or PNG Sprites (with CSS3) then I reckon you should try out Sonic Creator.

Building SIML: A new markup language

james — Sun, 17 Mar 2013 22:22:04 +0000

A few weeks ago I set about creating a new markup language. I wanted to learn more about language parsing, grammars, and the various difficulties involved.

I also had a very specific idea of what I wanted to create: a dead simple alternative to HTML. I’d recently picked up SASS and tried to draw on its succinctness to inspire me. CSS itself is quite succinct in how it declares elements, IDs, classes and attributes. And SASS, drawing on its own inspiration, HAML, adds the elegance of tabbed nesting.

I’d done something similar a while ago, allowing you to get DOM structures from basic CSS selectors:

ul li:5 span[innerHTML="item"]

Using satisfy() this becomes:


    item
    item
    item
    item
    item

But I didn’t want to stop there; I wanted to create a way to define entire HTML documents with minimal syntax. i.e. allowing you write stuff like:

html
  head
    title 'something'
  body
    h1 a[href=/] 'something'

Creating the parser

I began by looking into PEGjs, a really impressive parser generator for JavaScript. It allows you to specify the rules of your grammar like so:

Single
  = Attribute
  / Element
  / Text
  / Directive
 
//...
 
Attribute
  = name:AttributeName _ ":" _ value:Value (_ ";")? {
    // This bit is just regular JavaScript...
    return ['Attribute', [name, value]];
  }

Above specifies the grammar rule, Single, which defines various valid “Single” definitions, such as Attribute, which is also specified above. The Attribute rule references AttributeName:

AttributeName
  = name:[A-Za-z0-9-_]+ { return name.join(''); }
  / String

An AttributeName can be a string of characters matching the pattern [A-Za-z0-9-_]+ or a String (wrapped in quotes), which is also specified in the grammar.

It’s seemingly dead-simple, although there are gotchas like left-hand-side recursion and poisonously inefficient backtracking. At one point it was taking my parser 700ms to parse this:

a {
  b {
    c {}
  }
}

I found that I was writing rules in such a way that meant there was a lot of backtracking happening. I.e. when the parser tried a rule and failed on it, it would go back to the initial character trying the next alternate rule. In a nutshell, don’t do this:

SomeRule
  = [a-zA-Z]+ '::' [0-9]+ ';'
  / [a-zA-Z]+ '::' [0-9]+

Instead, just make the semi-colon optional:

SomeRule
  = [a-zA-Z]+ '::' [0-9]+ ';'?

This may seem trivial but it’s not always easier to spot for higher level rules. Small optimisations like this matter.

I was able to get that ridiculous 700ms down to 5ms! And there are still improvements to be made.

Creating the generator

The generator would have to be able to take output from the parser and generate HTML from it. From a string like a b c the parser outputs a structure like this:

The HTML generation was quite simple to do. Essentially, I treated every Element as an entity that can have children. An Element’s children could be other Elements, Attributes, Text or even custom directives. So, this:

label {
  label: foo;
  input#foo
}

Would parse to:

[
   "Element",
   [
      [
         [
            "Tag",
            "label"
         ]
      ],
      [
         [
            "IncGroup",
            [
               [
                  "Attribute",
                  [
                     "label",
                     "foo"
                  ]
               ],
               [
                  "Element",
                  [
                     [
                        [
                           "Tag",
                           "input"
                        ],
                        [
                           "Id",
                           "foo"
                        ]
                     ]
                  ]
               ]
            ]
         ]
      ]
   ]
]

Essentially, the hiararchy that you originally write is reflected in the tree outputted by the parser. The generator can then just recurse through this structure creating HTML strings as it goes along.

For example, this is the default generator for HTML attributes:

//...
    _default: {
      type: 'ATTR',
      make: function(attrName, value) {
        if (value == null) {
          return attrName;
        }
        return attrName + '="' + escapeHTML(value) + '"';
      }
    },
  //...

This would make `for:foo;` output the HTML, `for=”foo”`.

Fun feature: Exclusives

The fake power you feel when creating a language frequently manifests in strange features and syntax. That’s what happened here. Although I do genuinely feel that this particular one is useful.

I’m talking about “Exclusive Groups”. When writing your CSS-style selectors, it allows you to specify alternates within braces and then these will then be expanded so that the resulting HTML conforms to all the potential combinations. An example:

x (a/b) // expands to: "x a, x b"

That would give you:

A more complex example:

(a/b) (x/y)

That would give you:

The original selector (a/b)(x/y) expanded to a x, a y, b x, b y.

A little nifty, a little pointless.. perhaps. Although it can be useful:

ul li ('A'/'List'/'Of'/'Stuff')

(becomes)


  A
  List
  Of
  Stuff

Indentation

I wanted there to be the option to use traditional CSS curlies to demarcate nestings. I.e.

div {
  ul {
    li {
      //...
    }
  }
}

But I also wanted auto-nesting via indentation, like in SASS:

div
  ul
    li
      //...

Stuff became tricky, quickly. The problem with auto-nesting is that the expected behaviour can become ambiguous:

section
    h1
        em
      span
    div
        p

Furthermore, you have to contend with spaces and tabs. Which one counts as a single level of indentation?

The solution I eventually rested on was simply letting the user mess stuff up themselves, if they wanted. The parser will count levels of indentation by how many whitespace characters you have. I’d like to add an error that’s thrown if the user’s silly enough to mix tabs and spaces. For now, though, they’ll have to suffer. There is an inherent ambiguity in this kind of magic. What should the parser do with this? —

body
  div
    p {
    span
  em
    }

Right now, we assume, because the user has opted to use curlies on the p element, that the auto-nesting should be turned off until the curly closes. Another option would be to reset the indentation counter to zero and try to resolve children regularily. But the above code is still ambiguous. Should an error be thrown? Maybe “SyntaxError: What on earth are you doing?“

Is it done? What is it?

Yeh, it’s done, more or less.

It’s called SIML.
You can try it here!

Technically, it’s an HTML preprocessor. It’s not a templating engine. It doesn’t do that. Reasons are as follows:

Feature bloat
People still write plain ol’ HTML
Pure DOM templates are on the rise. See AngularJS or Knockout.

Also: client-side templating is a minefield of different approaches. I’ll stay out if I can.

SIML can cater to the DOM template style quite gracefully. This is using SIML’s Angular generator:

ul#todo-list > li
  @repeat( todo in todos | filter:statusFilter )
  @class({
    completed: todo.completed,
    editing: todo == editedTodo
  })

That produces:

The @foo things you see above are directives. You can create your own in a new generator, if you so wish. The Angular generator, by default, will create ng- HTML attributes from undefined psueod-classes and directives. So I could do:

div:cloak
  @show(items.length)

And that would generate:

Ideas and paths

It’s early days and I’m not even sure if SIML provides enough value as-is, but I do think it could serve devs quite well for the following use-cases:

Creating boilerplate HTML code quickly
Creating cleaner AngularJS/Knockout markup (Example)
Creating bespoke directives/pseudo-classes/attributes to serve your needs

The last point is quite powerful, I think. Imagine having a bunch of pre-defined directives that would allow you to do stuff like:

#sidebar
  input
    @datepicker({
      start: [2013,01,01]
    })

Closing remarks

As a learning exercise it was very valuable. I hope, as a happy accident, I’ve created something potentially useful to others.

Permissive user input validation

james — Sun, 27 Jan 2013 10:23:49 +0000

A ux.stackexchange question prompted me to consider how one might implement a more permissive type of input validation. It’s not rare for a form to punish the user if they add an extra space before typing in a date, or accidentally use a comma instead of a period when typing in an IP address. After-all, we employ strict validation to keep the data correct.

Garbage In — Garbage Out. It rings true but maybe, taken too literally, it makes us form strict validation and a no-exceptions policy to rebels. We punish a user typing ’12’ instead of the fully-qualified ‘2012’,… why? Either it’s our thoughtlessness or it’s the very unlikely (depending on context) possibility that the user did in-fact mean the year ‘1912’ or ‘1812’ or ‘1012’…

If we start down the road of permissive input validation then we need to also explore input correction. We can’t allow a rogue comma to slip in and not correct it. It’s probably best to correct it straight away (not too soon — possibly on blur) so that the actual data stored conforms to the correct format.

William Hudson executed a date survey in 2009 to discover all the various ways American users like to enter dates. The results show that users use a variety of formats. It makes perfect sense to accept all these variants and let the computer figure out what is what.

For the specific problem of entering dates, I would like to recommend Date.js, because it can successfully parse most of those variants. However, there is a big caveat when it comes to dates, especially on international forms. The American style of entering a date, MM/DD/YY, is technically impossible to differentiate from the other standard of DD/MM/YY, unless the DD portion happens to be above 12. For this reason I guess it would be best to cater to your localized users as best as possible.

An alternative is to retain rigidity in your validation but allow for some minor mistakes. For example, insist upon the ISO format of YYYY-MM-DD but don’t make a fuss if the user separates with a slash or a space (or heck, anything) instead of a dash.

My point is: Maybe formal validation with permissive aspects mixed in gives us the best of both worlds. We don’t punish the user for minor mistakes, and we don’t end up with ambiguous data.

In an attempt to practice this technique of mixing rigidity with leniency, I created vic.js.

Currently validation in JavaScript can be quite an ugly affair, plagued with remnants of DHTML and overly invasive input masks. It’s not uncommon to see stuff like this:

someInput.onkeyup = function() {
  if (!this.value.match(/some rigid regex/)) {
    alert('Enter the right value, you fool');
  }
};

Typically the rules are strict, the characters non-negotiable, the regular expression unyielding, and the presented invalidation UI annoying.

vic.js (a.k.a Vic, VIC) allows you to define a lenient regular expression, and it expects you to extract your important data from the captured groups.

Vic’s signature goes something like this:

vic(
  LENIENT_PATTERN_WITH_CAPTURED_GROUPS,
  PER_GROUP_PROCESSOR,
  POST_PROCESSOR
);

The simple example would be a ‘year’ field:

var yearVic = vic(
  /^s*(d{1,4})s*$/,
  function(year) {
    // Let's assume anything between 14 and 99 is from the 1900s:
    return vic.pad(year > 13 && year <= 99 ? '1900' : '2000' )(year);
  },
  Number // cast full output to a Number
);
 
yearVic('2012');   // => 2012
yearVic('01');     // => 2001
yearVic('hd2kd9'); // => false
yearVic('20021');  // => false
yearVic('96');     // => 1996
yearVic('  4');    // => 2004
yearVic('113');    // => 2113

The regex used for the year example, /^s*(d{1,4})s*$, is lenient in that it allows whitespace at the beginning and end, and doesn’t mind if the user enters one, two, three or four digits for the year. For years greater than 13 or less than 100 we assume the user is referring to the previous century, so we apply ‘1900’ as padding, otherwise we assume we should pad with ‘2000’.

Vic offers a couple of helpers for basic tasks like padding, applying lower/upper case, etc. I’ll probably be adding to these as I think of more common use-cases for vic.

Vic allows more atomized per-group processing too. In this example we’ll validate a date in the form YYYY-MM-DD, but we’ll allow any one of ./,:- (plus spaces) as separators, and we’ll validate the component numbers and pad them too:

var vicDate = vic(/^s*(d{1,4})[./,: -](d{1,2})[./,: -](d{1,2})s*$/, {
    1: function(year) {
      // Year between 50 and 99 assumed to be '19YY', otherwise presumed after 2000
      return vic.pad(year >= 50 && year <= 99 ? '1900' : '2000' )(year);
    },
    2: function(month) {
      return month >= 1 && month <= 12 && vic.pad('00')(month);
    },
    3: function(day, i, all) {
      // Check that there are {day} amount of days in the entered month:
      return day > 0 &&
        day <= new Date(all[1], all[2], 0).getDate() &&
        vic.pad('00')(day);
    }
}, function(v) {
  return v.join('-');
});
 
vicDate('111');       // => false
vicDate('2/3/4/5');   // => false
vicDate('16.332.2');  // => false
vicDate('20  1  20'); // => false
vicDate(' 1999.7.0'); // => false
vicDate('1999.0.1');  // => false
 
vicDate('1999.9.32'); // => false (no 32 in Sept)
vicDate('1999.2.28'); // => '1999-02-28'
vicDate('1999.2.31'); // => false (no 31 in Feb)
 
vicDate('1.1.1');     // => '2001-01-01'
vicDate('1956.3.2');  // => '1956-03-02'
vicDate('16.03-2');   // => '2016-03-02'
vicDate(' 20 1 20 '); // => '2020-01-20'
vicDate('1999.7.31'); // => '1999-07-31'

What we’ve done above is execute a rigid validation of the data that’s important to us (YYYY, MM and DD) while letting the user mess with the non-important stuff to their heart’s content (whitespace & separators).

Vic is simple. It’s not a high level abstraction but it’s not complex. It’s a few lines of code.

The fact is: you could easily integrate this methodology into your own validation utilities. The basic principle is to extract the important data, validate it, but allow the user some flexibility in how they give you the important data.

Check out vic.js on Github.

Who maintains your JS?

james — Sat, 05 Jan 2013 12:11:24 +0000

My last post discussed techniques I once saw as the height of cleverness but now deem foolish. The truth is: I’m still battling with these coding dilemmas everyday… It seems to be a constant game of cleverness/terseness/speed vs. verbosity/readability.

We’re told to write code for the poor soul who has to maintain it in the future. What nobody tells you is how much knowledge this poor soul has, both about the language and the problem domain. People also say to expect this future maintainer to be an idiot — this will mean you write the most understandable code possible. But again: how much of an idiot is this person?

The fact is: we have a set of unspoken assumptions we tend to make about this mystery future maintainer.

The following are mostly assumptions related to JavaScript syntax. Unexplored assumptions are ‘Design patterns’, ‘OOP’, ‘The problem domain’ etc.

Level 1 maintainer

Sentient humanoid
Speaks/Writes English well
Knows what JavaScript is
Has programmed before

Level 2 maintainer

Knows the correct syntax for if, else, for, while, do, throw, function, var …
Knows about types
Knows all the types available in JS
Knows the difference between a statement and an expression

Level 3 maintainer

Knows about strict vs. non-strict equality operators
Knows about short-circuiting in logical operators
Knows the difference between Arrays and array-like Objects
Knows several ways to cast a value, and ways to implicitly force coercion
Knows regular expressions to an intermediate level (anchors, alternation, character classes)

Level 4 maintainer

Knows what a bitwise operator does
Knows about primitives vs. objects, and wrapped primitives for method calling
Knows what a closure is
Knows how properties are resolved via the prototype chain
Knows the differences between prefix and postfix increment operators
Knows about variable/function-declaration hoisting
Knows the exact difference between x==null and x===null
Knows the difference between function declarations and function expressions

Level 5 maintainer

Knows how to check types without using typeof or instanceof
Knows why new Number(1) != new Number(1)
Knows regular expressions well (lookarounds, capture vs. no-capture, greed)
Knows what named function expressions are
Knows the side effects of bitwise operations in JavaScript

Level 6 maintainer (Eich level)

Knows what values are returned by all JavaScript operators
Knows all precedence and associativity of all operators
Knows where ASI will put semi-colons
Knows the subtle differences between ES 3.1, 5.1 and 6th Edition draft

*** The above is not exhaustive and is very subjective. I don’t wish to compartmentalize understanding into levels — it’s misleading. I just wanted to somehow portray these assumptions in a clearly understandable way.

I think most devs simply do not consider a future maintainer that is any less capable than themselves, and perhaps this is a problem. If you mark yourself as a maintainer 4.5 then is it fair for you to write in a way that presumes that all future readers of your code are <= 4.5? Is this reasonable?

I suppose it depends on what the code does (e.g. high level MVC sugar vs. low-level 3D rendering) and the people that you’re currently working with, for it is from them that you are likely to draw your estimation of an appropriate maintainer level.

I’m interested to hear other peoples’ opinions on this.

What level do you assume in your future maintainers? How clever does your code get before you deem it reasonable to add a comment or opt for a more explicit and clearer technique?

JS adolescence

james — Mon, 12 Nov 2012 23:31:50 +0000

For me there was a time that can only be described as adolescence in the field of programming and more specifically JavaScript. This period was characterised by a certain laziness and hubris. I thought I was right. I thought others were wrong. I could see no path but the very strict one that I had boxed myself into through a process in which I longed for certainty and the expulsion of realistic doubt.

Today I present a short list of JavaScript practices that once seemed right but I now deem foolish (in most situations):

Using logical operators as statements

a() || b.foo.bar(); m && c() ? d() : l || m.foo[bee](3);

There is no real advantage in forgoing convention for the sake of cleverness.

Always declaring variables at the top of their scope

function calculate() {
  var a, m, t, y, This, is, really, understandable = false;
  // Actual logic goes here
}

Instead I declare a variable where it makes sense. Much of the time this is at the top of the scope, but there are enough exceptions to make me shake my head in dissapproval at the thought of a style guide imposing religious top-of-scope declarations.

Repeatedly used inline object literals

function prepareRequest(mode, updateId, value) {
  var req = {
    mode: mode,
    updateId: updateId,
    value: value,
    uid: genUID()
  };
  // etc.
}

It’s better for clarity and performance (potentially) to define separate classes for cases where you repeatedly make object literals of the same structure. For example the above could be refactored into a Request class.

Complex inline regular expressions

if (/r[ -~]?(?:md+|d+[0-3]{2,})_m?$/.test(thing)) {
  // ... wat
}

It makes sense to cache these things. Plus naming it means someone might have a chance of understanding WTF the code does.

Using single letter variable names

They make no sense unless used in the various idiomatic loops.

Strict equality everywhere

if (typeof x === 'string' && x === 'mode:45') {
  // ... 
}

Sometimes it’s overkill. Regular equality can be used in both of the above cases.

Assuming truthiness equals presence

if (!cachedReturnValues[key]) {
  cachedReturnValues[key] = value;
}

It’s potentially dangerous. E.g. It’s better to use the ‘in’ operator to discover presence, or even ‘hasOwnProperty’.

Commenting nothing or commenting every little thing

// This is a loop
for (var i = 0; i < arr.length; i++) {
  // This is me adding one to the item in the array at the index
  // specified by the variable `i` declared slightly above this comment:
  arr[i] += 1;
} // this is the end of the loop

Commenting everything demonstrates either a lacking clarity in your code or a lacking clarity in your mind as to the intention of comments. Commenting nothing shows hubris and laziness.

Thinking I can write something reliable without unit tests

Unit tests give me the confidence to be sure that my code does what I intend it to do.

Using long names and being overly specific

A long name suggests bad API design. Re-think where you’re defining the variable/method. Think of a circumstance in which the variable name could be shorter — a context that makes sense for that variable. Then create that context. You’ll probably end up with a cleaner internal API. E.g.

// Version 1:
var responseCache = {};
function generateNextIDForANewResponseCacheItem() {...}
function deleteResponseCacheItemsThatDoNotEqual(value) {...}
 
// Version 2:
function ResponseCache() {
  this.data = {};
}
ResponseCache.prototype = {
  genNextID: function() {...},
  remove: function(optionalFnCheck) {...}
};

Strict coherence to a standard

This never works and never solves what you think it will. It’s better to dynamically sway and constantly rethink your idea of what is right and wrong given each situation as it arises. This is why I try to steer clear of over-zealous coding style dogma, or all dogma for that matter.