A while ago I posted a method I had been using at the time to remove comments from JavaScript code. It was pretty decent – instead of using a regular expression it steps through each character and removes comments where it finds them.
At the time I thought stepping through a string character-by-character was the only reliable way to solve the “comments problem” but after giving it another attempt I found that it was possible with a only a few regular expressions and a fairly moderate dose of JavaScript’s replace()
function.
UPDATE: Be cautious about using this — it’s come to my attention there are still some issues with it. Maybe, for now, trust the parsers over this π
Here it is:
function removeComments(str) { var uid = '_' + +new Date(), primatives = [], primIndex = 0; return ( str /* Remove strings */ .replace(/(['"])(\1|.)+?1/g, function(match){ primatives[primIndex] = match; return (uid + '') + primIndex++; }) /* Remove Regexes */ .replace(/([^/])(/(?!*|/)(\/|.)+?/[gim]{0,3})/g, function(match, $1, $2){ primatives[primIndex] = $2; return $1 + (uid + '') + primIndex++; }) /* - Remove single-line comments that contain would-be multi-line delimiters E.g. // Comment /* <-- - Remove multi-line comments that contain would be single-line delimiters E.g. /* // <-- */ .replace(///.*?/?*.+?(?=n|r|$)|/*[sS]*?//[sS]*?*//g, '') /* Remove single and multi-line comments, no consideration of inner-contents */ .replace(///.+?(?=n|r|$)|/*[sS]+?*//g, '') /* Remove multi-line comments that have a replaced ending (string/regex) Greedy, so no inner strings/regexes will stop it. */ .replace(RegExp('\/\*[\s\S]+' + uid + '\d+', 'g'), '') /* Bring back strings & regexes */ .replace(RegExp(uid + '(\d+)', 'g'), function(match, n){ return primatives[n]; }) ); } |
Theoretically this should work perfectly in almost all situations. Don’t bother even trying it with E4X as that definitely won’t work! E.g.
var someE4X = <box>// this is NOT a comment</box>; |
It’s impossible to cater to E4X with regular expressions because XML is a recursive structure. I’m not bothered though as E4X isn’t exactly a widely used extension. It also doesn’t play well with conditional compilation but frankly, conditional compilation shouldn’t exist anyway.
Anyway, back to the solution. It takes a pretty conventional approach of removing all strings and regular expressions first and then moving on to the comments. Unfortunately comments are not as simple as /*.+?*/
– there are nested comments within strings, nested comments within literal-regular-expressions and nested comments within other comments.
The only part of my solution which might need further explanation is the following expression:
///.*?/?*.+?(?=n|r)|/*[sS]*?//[sS]*?*//g |
This regex searches for single-line or multi-line comments that have the starting delimiter of the other type of comment within. For example:
/* This is a multi-line comment // Still a multi-line comment */ // This is a single-line comment /* ...still a single-line comment |
These two situations need to be catered to so that the subsequently executed regular expression doesn’t remove the wrong parts of such comments.
I’ve tested it with the following JavaScript and it works perfectly:
'string' // still a string'; // comment /* not-a-nested-comment /regex/; // comment */* still-a-comment ' /**/ string ' /* "comment..." // still-a-comment */ alert('This isn't a comment!'); //* this isn't a comment! */; //* comment /* //a comment... // still-a-comment 12345 "Foo /bar/ "" */ /*//Boo*/ /*/**/ |
Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!
Hmm.. it seems to work pretty well.
These kinds of regex replacement things are rarely foolproof though of course. You did a nice job though.
But I threw this at it and it failed, haha. Notepad++ got it right though.
Hi James…
Nice script. Lord knows I couldn’t have come up with anything in that league.
It seems that if the last line is a single line comment and the line doesn’t have a break at the end, the script doesn’t recognize the line as a comment.
Doesn’t recognize the last line…
@Corey, Ah.. thanks. Fixed now! π
@Nicolaj, fixed too, (forgot to add
$
to regex)Hi,
I’ve noticed that you often use this expression: ‘some_string’ + +new Date().
It’s good practice to add brackets ‘some_string’ + (+new Date()) to avoid errors after minifying code.
Just been testing it a little more and have found some more bugs. I guess it’s true what they say; the comment problem can’t be solved with regular expressions.
@Piotr, any minifying program worth its salt won’t screw up that line.
// comment
/* comment */ program //comment
Still no perferct, sorry.
This source doesn’t print “program” as expected, but “/comment”.
But nice try!
try this one to capture both SingleLine and MultiLine Comments:
/(/*[u0000-uFFFF]*?(?=*/)*/|//[^u000A|u000D|u2028|u2029]*)/
Works for all of the examples in the comments
I am sure that [u0000-uFFFF] is probably too broad but I could not find a good example of what a unicode “code point” was according to ecmascript spec.
I searched in many places but none of those codes worked for me. I wanted to remove starting multiline comments, and your regex worked without problem! (I will give you creds inside my code). Thank you!
Incidentally, you’ve misspelled “primitives”. π
Hello there.
Thanks for the post. You saved me a lot of time – I needed something like this to parse the comments out a CSS style sheet. Excellent research!
“the comment problem canβt be solved with regular expressions. ”
Maybe can be solved with regular expressions:
function removeComments(str){
return str.replace(reg,function(n,Arg_comments){
return Acomments?”:n
})
/*
reg=/(comments)|(string)|(regexp)/g
when Arg_comments is not empty string , it’s mean comments is found ,replace it with “”, if not , return the string (like string “‘/***/'” or “//..” or /./**/ ….) itself .
*/
}