A few weeks ago I set about creating a new markup language. I wanted to learn more about language parsing, grammars, and the various difficulties involved.
I also had a very specific idea of what I wanted to create: a dead simple alternative to HTML. I’d recently picked up SASS and tried to draw on its succinctness to inspire me. CSS itself is quite succinct in how it declares elements, IDs, classes and attributes. And SASS, drawing on its own inspiration, HAML, adds the elegance of tabbed nesting.
I’d done something similar a while ago, allowing you to get DOM structures from basic CSS selectors:
ul li:5 span[innerHTML="item"] |
Using satisfy() this becomes:
<ul> <li><span>item</span></li> <li><span>item</span></li> <li><span>item</span></li> <li><span>item</span></li> <li><span>item</span></li> </ul> |
But I didn’t want to stop there; I wanted to create a way to define entire HTML documents with minimal syntax. i.e. allowing you write stuff like:
html head title 'something' body h1 a[href=/] 'something' |
Creating the parser
I began by looking into PEGjs, a really impressive parser generator for JavaScript. It allows you to specify the rules of your grammar like so:
Single = Attribute / Element / Text / Directive //... Attribute = name:AttributeName _ ":" _ value:Value (_ ";")? { // This bit is just regular JavaScript... return ['Attribute', [name, value]]; } |
Above specifies the grammar rule, Single, which defines various valid “Single” definitions, such as Attribute, which is also specified above. The Attribute rule references AttributeName:
AttributeName = name:[A-Za-z0-9-_]+ { return name.join(''); } / String |
An AttributeName can be a string of characters matching the pattern [A-Za-z0-9-_]+
or a String (wrapped in quotes), which is also specified in the grammar.
It’s seemingly dead-simple, although there are gotchas like left-hand-side recursion and poisonously inefficient backtracking. At one point it was taking my parser 700ms to parse this:
a { b { c {} } } |
I found that I was writing rules in such a way that meant there was a lot of backtracking happening. I.e. when the parser tried a rule and failed on it, it would go back to the initial character trying the next alternate rule. In a nutshell, don’t do this:
SomeRule = [a-zA-Z]+ '::' [0-9]+ ';' / [a-zA-Z]+ '::' [0-9]+ |
Instead, just make the semi-colon optional:
SomeRule = [a-zA-Z]+ '::' [0-9]+ ';'? |
This may seem trivial but it’s not always easier to spot for higher level rules. Small optimisations like this matter.
I was able to get that ridiculous 700ms down to 5ms! And there are still improvements to be made.
Creating the generator
The generator would have to be able to take output from the parser and generate HTML from it. From a string like a b c
the parser outputs a structure like this:
[ "Element", [ [ [ "Tag", "a" ] ], [ [ "Element", [ [ [ "Tag", "b" ] ], [ [ "Element", [ [ [ "Tag", "c" ] ] ] ] ] ] ] ] ] ] |
The HTML generation was quite simple to do. Essentially, I treated every Element as an entity that can have children. An Element’s children could be other Elements, Attributes, Text or even custom directives. So, this:
label { label: foo; input#foo } |
Would parse to:
[ "Element", [ [ [ "Tag", "label" ] ], [ [ "IncGroup", [ [ "Attribute", [ "label", "foo" ] ], [ "Element", [ [ [ "Tag", "input" ], [ "Id", "foo" ] ] ] ] ] ] ] ] ] |
Essentially, the hiararchy that you originally write is reflected in the tree outputted by the parser. The generator can then just recurse through this structure creating HTML strings as it goes along.
For example, this is the default generator for HTML attributes:
//... _default: { type: 'ATTR', make: function(attrName, value) { if (value == null) { return attrName; } return attrName + '="' + escapeHTML(value) + '"'; } }, //... |
This would make `for:foo;` output the HTML, `for=”foo”`.
Fun feature: Exclusives
The fake power you feel when creating a language frequently manifests in strange features and syntax. That’s what happened here. Although I do genuinely feel that this particular one is useful.
I’m talking about “Exclusive Groups”. When writing your CSS-style selectors, it allows you to specify alternates within braces and then these will then be expanded so that the resulting HTML conforms to all the potential combinations. An example:
x (a/b) // expands to: "x a, x b" |
That would give you:
<x> <a></a> </x> <x> <b></b> </x> |
A more complex example:
(a/b) (x/y) |
That would give you:
<a><x></x></a> <a><y></y></a> <b><x></x></b> <b><y></y></b> |
The original selector (a/b)(x/y)
expanded to a x, a y, b x, b y
.
A little nifty, a little pointless.. perhaps. Although it can be useful:
ul li ('A'/'List'/'Of'/'Stuff') |
(becomes)
<ul> <li>A</li> <li>List</li> <li>Of</li> <li>Stuff</li> </ul> |
Indentation
I wanted there to be the option to use traditional CSS curlies to demarcate nestings. I.e.
div { ul { li { //... } } } |
But I also wanted auto-nesting via indentation, like in SASS:
div ul li //... |
Stuff became tricky, quickly. The problem with auto-nesting is that the expected behaviour can become ambiguous:
section h1 em span div p |
Furthermore, you have to contend with spaces and tabs. Which one counts as a single level of indentation?
The solution I eventually rested on was simply letting the user mess stuff up themselves, if they wanted. The parser will count levels of indentation by how many whitespace characters you have. I’d like to add an error that’s thrown if the user’s silly enough to mix tabs and spaces. For now, though, they’ll have to suffer. There is an inherent ambiguity in this kind of magic. What should the parser do with this? —
body div p { span em } |
Right now, we assume, because the user has opted to use curlies on the p
element, that the auto-nesting should be turned off until the curly closes. Another option would be to reset the indentation counter to zero and try to resolve children regularily. But the above code is still ambiguous. Should an error be thrown? Maybe “SyntaxError: What on earth are you doing?
“
Is it done? What is it?
Yeh, it’s done, more or less.
Technically, it’s an HTML preprocessor. It’s not a templating engine. It doesn’t do that. Reasons are as follows:
- Feature bloat
- People still write plain ol’ HTML
- Pure DOM templates are on the rise. See AngularJS or Knockout.
Also: client-side templating is a minefield of different approaches. I’ll stay out if I can.
SIML can cater to the DOM template style quite gracefully. This is using SIML’s Angular generator:
ul#todo-list > li @repeat( todo in todos | filter:statusFilter ) @class({ completed: todo.completed, editing: todo == editedTodo }) |
That produces:
<ul id="todo-list"> <li ng-repeat="todo in todos | filter:statusFilter" ng-class="{ completed: todo.completed, editing: todo == editedTodo }" ></li> </ul> |
The @foo
things you see above are directives. You can create your own in a new generator, if you so wish. The Angular generator, by default, will create ng-
HTML attributes from undefined psueod-classes and directives. So I could do:
div:cloak @show(items.length) |
And that would generate:
<div ng-cloak ng-show="items.length"></div> |
Ideas and paths
It’s early days and I’m not even sure if SIML provides enough value as-is, but I do think it could serve devs quite well for the following use-cases:
- Creating boilerplate HTML code quickly
- Creating cleaner AngularJS/Knockout markup (Example)
- Creating bespoke directives/pseudo-classes/attributes to serve your needs
The last point is quite powerful, I think. Imagine having a bunch of pre-defined directives that would allow you to do stuff like:
#sidebar input @datepicker({ start: [2013,01,01] }) |
Closing remarks
As a learning exercise it was very valuable. I hope, as a happy accident, I’ve created something potentially useful to others.
Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!
Looks promising! Currently I’m using jade (http://jade-lang.com/) for that purpose.. Do you know it?
@Lars, Yeh I found Jade a few days after starting work on this — and I do like it. It’s concise and expressive. I did find some of the notation to be quite alien to me though (although thankfully less alien than HAML). What I originally wanted for SIML was something that anyone familiar with CSS (and maybe SASS) could pick up quite quickly.
It’s a quite entertaining/educational experience, even though there are some good options out there, like the already mentioned jade, and also zen coding, which has been out there for a while and now it’s even available within Visual Studio 2012 with its latest official update. The latter allows stuff like this:
div#myId>select>option[value=someValue]*5>lorem
generate:
Lorem ipsum dolor sit amet, consectetur adipiscing elit fusce vel sapien elit in malesuada semper mi, id sollicitudin urna fermentum ut fusce varius nisl ac ipsum gravida vel pretium tellus.
Tincidunt integer eu augue augue nunc elit dolor, luctus placerat scelerisque euismod, iaculis eu lacus nunc mi elit, vehicula ut laoreet ac, aliquam sit amet justo nunc tempor, metus vel.
Placerat suscipit, orci nisl iaculis eros, a tincidunt nisi odio eget lorem nulla condimentum tempor mattis ut vitae feugiat augue cras ut metus a risus iaculis scelerisque eu ac ante.
Fusce non varius purus aenean nec magna felis fusce vestibulum velit mollis odio sollicitudin lacinia aliquam posuere, sapien elementum lobortis tincidunt, turpis dui ornare nisl, sollicitudin interdum turpis nunc eget.
Sem nulla eu ultricies orci praesent id augue nec lorem pretium congue sit amet ac nunc fusce iaculis lorem eu diam hendrerit at mattis purus dignissim vivamus mauris tellus, fringilla.
And this is just the tip of the iceberg
@Ricardo, Yup, Zen coding’s pretty cool. Isn’t it now called Emmet though? – http://docs.emmet.io/
The HTML I put in my comment was obliterated and I couldn’t edit my comment, two good features you could add here 🙂
No, in this case it’s really called zen coding, at least in Visual Studio that’s what they’re calling it, and since on that one I don’t see VS in the supported list, I don’t think it’s the same, even though it looks very similar!
Welcome to the wonderful world of parsing. Hope you have a pleasant stay 😉 Next stop: writing the parser yourself.
😀
@peter — I hope I’m brave enough! I think that’s definitely the next challenge.
This is brilliant and I can’t wait for it to wholesale replace the manual writing of HTML.