A while ago I posted a method I had been using at the time to remove comments from JavaScript code. It was pretty decent – instead of using a regular expression it steps through each character and removes comments where it finds them.

At the time I thought stepping through a string character-by-character was the only reliable way to solve the “comments problem” but after giving it another attempt I found that it was possible with a only a few regular expressions and a fairly moderate dose of JavaScript’s replace() function.

UPDATE: Be cautious about using this — it’s come to my attention there are still some issues with it. Maybe, for now, trust the parsers over this πŸ™‚

Here it is:

function removeComments(str) {
 
    var uid = '_' + +new Date(),
        primatives = [],
        primIndex = 0;
 
    return (
        str
        /* Remove strings */
        .replace(/(['"])(\1|.)+?1/g, function(match){
            primatives[primIndex] = match;
            return (uid + '') + primIndex++;
        })
 
        /* Remove Regexes */
        .replace(/([^/])(/(?!*|/)(\/|.)+?/[gim]{0,3})/g, function(match, $1, $2){
            primatives[primIndex] = $2;
            return $1 + (uid + '') + primIndex++;
        })
 
        /*
        - Remove single-line comments that contain would-be multi-line delimiters
            E.g. // Comment /* <--
        - Remove multi-line comments that contain would be single-line delimiters
            E.g. /* // <-- 
       */
        .replace(///.*?/?*.+?(?=n|r|$)|/*[sS]*?//[sS]*?*//g, '')
 
        /*
        Remove single and multi-line comments,
        no consideration of inner-contents
       */
        .replace(///.+?(?=n|r|$)|/*[sS]+?*//g, '')
 
        /*
        Remove multi-line comments that have a replaced ending (string/regex)
        Greedy, so no inner strings/regexes will stop it.
       */
        .replace(RegExp('\/\*[\s\S]+' + uid + '\d+', 'g'), '')
 
        /* Bring back strings & regexes */
        .replace(RegExp(uid + '(\d+)', 'g'), function(match, n){
            return primatives[n];
        })
    );
 
}

Theoretically this should work perfectly in almost all situations. Don’t bother even trying it with E4X as that definitely won’t work! E.g.

var someE4X = <box>// this is NOT a comment</box>;

It’s impossible to cater to E4X with regular expressions because XML is a recursive structure. I’m not bothered though as E4X isn’t exactly a widely used extension. It also doesn’t play well with conditional compilation but frankly, conditional compilation shouldn’t exist anyway.

Anyway, back to the solution. It takes a pretty conventional approach of removing all strings and regular expressions first and then moving on to the comments. Unfortunately comments are not as simple as /*.+?*/ – there are nested comments within strings, nested comments within literal-regular-expressions and nested comments within other comments.

The only part of my solution which might need further explanation is the following expression:

///.*?/?*.+?(?=n|r)|/*[sS]*?//[sS]*?*//g

This regex searches for single-line or multi-line comments that have the starting delimiter of the other type of comment within. For example:

/*
    This is a multi-line comment
    // Still a multi-line comment
*/
 
// This is a single-line comment /* ...still a single-line comment

These two situations need to be catered to so that the subsequently executed regular expression doesn’t remove the wrong parts of such comments.

I’ve tested it with the following JavaScript and it works perfectly:

'string' // still a string'; // comment /* not-a-nested-comment
/regex/; // comment */* still-a-comment
' /**/ string ' /* "comment..."
// still-a-comment */ alert('This isn't a comment!');
//* this isn't a comment! */; //* comment
/*
    //a comment... // still-a-comment
    12345
    "Foo /bar/ ""
*/
/*//Boo*/
/*/**/

Try the demo!

Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!