JavaScript comment removal – revisited

A while ago I posted a method I had been using at the time to remove comments from JavaScript code. It was pretty decent – instead of using a regular expression it steps through each character and removes comments where it finds them.

At the time I thought stepping through a string character-by-character was the only reliable way to solve the “comments problem” but after giving it another attempt I found that it was possible with a only a few regular expressions and a fairly moderate dose of JavaScript’s replace() function.

UPDATE: Be cautious about using this — it’s come to my attention there are still some issues with it. Maybe, for now, trust the parsers over this 🙂

Here it is:

function removeComments(str) {
 
    var uid = '_' + +new Date(),
        primatives = [],
        primIndex = 0;
 
    return (
        str
        /* Remove strings */
        .replace(/(['"])(\1|.)+?1/g, function(match){
            primatives[primIndex] = match;
            return (uid + '') + primIndex++;
        })
 
        /* Remove Regexes */
        .replace(/([^/])(/(?!*|/)(\/|.)+?/[gim]{0,3})/g, function(match, $1, $2){
            primatives[primIndex] = $2;
            return $1 + (uid + '') + primIndex++;
        })
 
        /*
        - Remove single-line comments that contain would-be multi-line delimiters
            E.g. // Comment /* <--
        - Remove multi-line comments that contain would be single-line delimiters
            E.g. /* // <-- 
       */
        .replace(///.*?/?*.+?(?=n|r|$)|/*[sS]*?//[sS]*?*//g, '')
 
        /*
        Remove single and multi-line comments,
        no consideration of inner-contents
       */
        .replace(///.+?(?=n|r|$)|/*[sS]+?*//g, '')
 
        /*
        Remove multi-line comments that have a replaced ending (string/regex)
        Greedy, so no inner strings/regexes will stop it.
       */
        .replace(RegExp('\/\*[\s\S]+' + uid + '\d+', 'g'), '')
 
        /* Bring back strings & regexes */
        .replace(RegExp(uid + '(\d+)', 'g'), function(match, n){
            return primatives[n];
        })
    );
 
}

Theoretically this should work perfectly in almost all situations. Don’t bother even trying it with E4X as that definitely won’t work! E.g.

var someE4X = <box>// this is NOT a comment</box>;

It’s impossible to cater to E4X with regular expressions because XML is a recursive structure. I’m not bothered though as E4X isn’t exactly a widely used extension. It also doesn’t play well with conditional compilation but frankly, conditional compilation shouldn’t exist anyway.

Anyway, back to the solution. It takes a pretty conventional approach of removing all strings and regular expressions first and then moving on to the comments. Unfortunately comments are not as simple as /*.+?*/ – there are nested comments within strings, nested comments within literal-regular-expressions and nested comments within other comments.

The only part of my solution which might need further explanation is the following expression:

///.*?/?*.+?(?=n|r)|/*[sS]*?//[sS]*?*//g

This regex searches for single-line or multi-line comments that have the starting delimiter of the other type of comment within. For example:

/*
    This is a multi-line comment
    // Still a multi-line comment
*/
 
// This is a single-line comment /* ...still a single-line comment

These two situations need to be catered to so that the subsequently executed regular expression doesn’t remove the wrong parts of such comments.

I’ve tested it with the following JavaScript and it works perfectly:

'string' // still a string'; // comment /* not-a-nested-comment
/regex/; // comment */* still-a-comment
' /**/ string ' /* "comment..."
// still-a-comment */ alert('This isn't a comment!');
//* this isn't a comment! */; //* comment
/*
    //a comment... // still-a-comment
    12345
    "Foo /bar/ ""
*/
/*//Boo*/
/*/**/

Try the demo!

Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!

Corey Worrell September 12th, 2009 at 5:02 am

Hmm.. it seems to work pretty well.
These kinds of regex replacement things are rarely foolproof though of course. You did a nice job though.

But I threw this at it and it failed, haha. Notepad++ got it right though.

var one = 'testing this "stuff"' now.'; // this is a comment /* hola */ // lol
 
/* This is a multiline comment
// still
/* still * * * * / */
something here

Nicolaj Kirkgaard Nielsen September 12th, 2009 at 7:01 am

Hi James…

Nice script. Lord knows I couldn’t have come up with anything in that league.

It seems that if the last line is a single line comment and the line doesn’t have a break at the end, the script doesn’t recognize the line as a comment.

var ok = "not really"; 
// This is a comment
// Shouldn't this be a comment too?

Doesn’t recognize the last line…

James September 12th, 2009 at 8:49 am

@Corey, Ah.. thanks. Fixed now! 🙂

@Nicolaj, fixed too, (forgot to add $ to regex)

Piotr Wasilewski September 12th, 2009 at 12:44 pm

Hi,

I’ve noticed that you often use this expression: ‘some_string’ + +new Date().
It’s good practice to add brackets ‘some_string’ + (+new Date()) to avoid errors after minifying code.

James September 12th, 2009 at 7:37 pm

Just been testing it a little more and have found some more bugs. I guess it’s true what they say; the comment problem can’t be solved with regular expressions.

@Piotr, any minifying program worth its salt won’t screw up that line.

tomh October 15th, 2009 at 11:02 am

// comment
/* comment */ program //comment

Still no perferct, sorry.
This source doesn’t print “program” as expected, but “/comment”.
But nice try!

Jason P October 23rd, 2009 at 1:56 am

try this one to capture both SingleLine and MultiLine Comments:

/(/*[u0000-uFFFF]*?(?=*/)*/|//[^u000A|u000D|u2028|u2029]*)/

Works for all of the examples in the comments

I am sure that [u0000-uFFFF] is probably too broad but I could not find a good example of what a unicode “code point” was according to ecmascript spec.

Alberto Lepe November 4th, 2009 at 6:28 am

I searched in many places but none of those codes worked for me. I wanted to remove starting multiline comments, and your regex worked without problem! (I will give you creds inside my code). Thank you!

James Rourke November 16th, 2009 at 12:42 am

Incidentally, you’ve misspelled “primitives”. 😉

Zoltan Hawryluk March 5th, 2010 at 3:56 pm

Hello there.

Thanks for the post. You saved me a lot of time – I needed something like this to parse the comments out a CSS style sheet. Excellent research!

caii April 2nd, 2010 at 8:11 am

“the comment problem can’t be solved with regular expressions. ”

Maybe can be solved with regular expressions:

function removeComments(str){
return str.replace(reg,function(n,Arg_comments){
return Acomments?”:n
})
/*
reg=/(comments)|(string)|(regexp)/g
when Arg_comments is not empty string , it’s mean comments is found ,replace it with “”, if not , return the string (like string “‘/***/'” or “//..” or /./**/ ….) itself .
*/
}

JavaScript comment removal – revisited

So far there's been 11 Responses to “JavaScript comment removal – revisited”

So far there's been 11 Responses to
“JavaScript comment removal – revisited”