Removing comments in JavaScript

For the recently developed debug.js (view) I had to come up with a way to remove all comments from any piece of JavaScript code.

I originally thought that this would be a piece of cake; a simple regex takes care of everything!

code.replace(//*.+?*/|//.*(?=[nr])/g, '');

This regular expression would have worked in 90% of situations but, unfortunately I had to build something that would work in every single situation.

It’s worth mentioning exactly when the above regular expression would fail:

When comment notation exists in a string, e.g.

var str = " /* not a real comment */ ";

When comment notation exists in a literal regular expression, e.g.

var regex = //*.*/;

When conditional compilation (supported in IE > 4) exists in the code, e.g.

/*@cc_on @*/
/*@if (@_jscript_version == 4)
alert("JavaScript version 4");
@else @*/
alert("Blah blah blah");
/*@end @*/

While the likelihood of any of the above happening is low it’s certainly worth catering to all potential situations; just encase one of them arises!

So, after a bit of googling and messing arround, it turns out that the only way of doing this properly is to loop through the code, character by character, checking for certain delimiters and then enabling/disabling modes as the loop progresses:

/* 
    This function is loosely based on the one found here:
    http://www.weanswer.it/blog/optimize-css-javascript-remove-comments-php/
*/
function removeComments(str) {
    str = ('__' + str + '__').split('');
    var mode = {
        singleQuote: false,
        doubleQuote: false,
        regex: false,
        blockComment: false,
        lineComment: false,
        condComp: false 
    };
    for (var i = 0, l = str.length; i < l; i++) {
 
        if (mode.regex) {
            if (str[i] === '/' && str[i-1] !== '\') {
                mode.regex = false;
            }
            continue;
        }
 
        if (mode.singleQuote) {
            if (str[i] === "'" && str[i-1] !== '\') {
                mode.singleQuote = false;
            }
            continue;
        }
 
        if (mode.doubleQuote) {
            if (str[i] === '"' && str[i-1] !== '\') {
                mode.doubleQuote = false;
            }
            continue;
        }
 
        if (mode.blockComment) {
            if (str[i] === '*' && str[i+1] === '/') {
                str[i+1] = '';
                mode.blockComment = false;
            }
            str[i] = '';
            continue;
        }
 
        if (mode.lineComment) {
            if (str[i+1] === 'n' || str[i+1] === 'r') {
                mode.lineComment = false;
            }
            str[i] = '';
            continue;
        }
 
        if (mode.condComp) {
            if (str[i-2] === '@' && str[i-1] === '*' && str[i] === '/') {
                mode.condComp = false;
            }
            continue;
        }
 
        mode.doubleQuote = str[i] === '"';
        mode.singleQuote = str[i] === "'";
 
        if (str[i] === '/') {
 
            if (str[i+1] === '*' && str[i+2] === '@') {
                mode.condComp = true;
                continue;
            }
            if (str[i+1] === '*') {
                str[i] = '';
                mode.blockComment = true;
                continue;
            }
            if (str[i+1] === '/') {
                str[i] = '';
                mode.lineComment = true;
                continue;
            }
            mode.regex = true;
 
        }
 
    }
    return str.join('').slice(2, -2);
}

The best way to wrap your head round the above code is to literally take it step by step. There are six modes; only one mode will be set to true at any time during iteration; this activated mode respresents what construct is currently being looped through (a string, a regular expression, a comment etc.). The modes include:

mode.singleQuote: Single-quote delimited string ('string').
mode.doubleQuote: Double-quote delimited string ("string).
mode.regex: Literal regular expression (/regex/.
mode.blockComment: Block comment (/*...*/).
mode.lineComment: Line comment (//...).
mode.condComp: Conditional compilation (/*@...@*/).

Here’s an example trail through the loop:

Using string ->   "a"" /*Boo!*/
 
01. Double quote; *mode.doubleQuote* activated.
02. Letter 'a'; loop continues.
03. Character ''; loop continues.
04. Double quote; ignored because the previous character is an escaper.
05. Double quote; last character is not ''; so *mode.doubleQuote* de-activated
06. Space; loop continues.
07. Character '/'; Next character is asterisk; *mode.blockComment* activated
    - character replaced with an empty string
08. Letter 'B'; loop continues.
    - character replaced with an empty string
09. Letter 'o'; loop continues.
    - character replaced with an empty string
10. Letter 'o'; loop continues.
    - character replaced with an empty string
11. Character '!'; loop continues.
    - character replaced with an empty string
12. Character '*' followed by '/'; *mode.blockComment* de-activated
    - both characters replaced with an empty string
 
Result ->   "a""

There’s quite a lot of forward/back-tracking involved, that’s why a couple of arbitrary characters are added to either end of the string before the loop; to make sure something is there when str[i-2] is queried.

Note: the code I used in the removeComments function could be shortened; in fact, the entire function could probably be squeezed into 20 lines but that would only slow it down. Terseness does not always equal speed, especially so in this situation; a somewhat repetitive stream of IF statements really is the only way to produce acceptable performance.

I’d love to be proven wrong in this situation so if anyone can come up with an easier way of doing this I’d love to hear it! Especially if you think you can solve this with regular expressions alone!

Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!

Lea Verou May 25th, 2009 at 1:23 am

I think you’ll come across similar problems when writing any kind of parser (and your script is actually a parser for JavaScript comments).

I’ve stumbled across it several times in the past, the last two while writing a syntax highlighter and the latest while writing a small parser for google-style search queries. In the first case (the syntax highlighter), I decided it’s not worth the extra resources for such edge cases, since it was mainly for my personal use and I wanted something fast, even if I had to sacrifice 100% correctness.

In the second, more recent case, I decided to take a similar approach as you did, since such cases were going to be really common and it’s also a commercial project, so I can’t have search queries failing due to lazy parsing.

I’m also really interested if there’s a better solution. I think perhaps it would be somehow possible by combining regex lookahead and lookbehind (in languages that support them, JS doesn’t support lookbehind 🙁 ) but I’m not very experienced with those two.

Rick May 25th, 2009 at 11:14 am

This looks very handy indeed. Very useful for CSS and Javascript optimisers among other things.

I may be wrong, but it looks like it could still be tripped up with double backslashes… Eg. “a\”” /*Boo!*/

Vasco Fernandes May 25th, 2009 at 2:06 pm

Actually, one of the first things you’ll learn in a compilers class is that no regular expression can take care of the comments problem, you need a full blown parser to do that. So i don´t think you can do it any simpler way than going through every character, you can however, if you’re willing to do it in c/c++ or java, use something like lex to do it with only a few lines of code.

James May 25th, 2009 at 6:02 pm

Thanks for the comments!

@Lea, I thought about whether or not it’d be achievable using regex lookaheads/behinds but I figured, even if JavaScript supported lookbehinds, it would be quite slow.

@Rick, you right; it would be tripped up by that, but it’s not valid JavaScript so I’m not bothered. "\"" would throw a syntax error (since the first is escaping the second there is nothing left to escape the second ").

@Vasco, It seems so; I read more about this technique over here: http://www.codeproject.com/KB/cs/jscompress.aspx – they seem to be using a similar method to remove comments. It’s funny how something that seems so simple can end up being quite complicated…

adam July 23rd, 2009 at 9:29 pm

great thanks for that piece of code 😀 it helped me very much!

JavaScript comment removal – revisted – James Padolsey September 11th, 2009 at 8:25 pm

[…] while ago I posted a method I had been using at the time to remove comments from JavaScript code. It was pretty decent – […]

Removing comments in JavaScript

So far there's been 6 Responses to “Removing comments in JavaScript”

So far there's been 6 Responses to
“Removing comments in JavaScript”