For the recently developed debug.js
(view) I had to come up with a way to remove all comments from any piece of JavaScript code.
I originally thought that this would be a piece of cake; a simple regex takes care of everything!
code.replace(//*.+?*/|//.*(?=[nr])/g, ''); |
This regular expression would have worked in 90% of situations but, unfortunately I had to build something that would work in every single situation.
It’s worth mentioning exactly when the above regular expression would fail:
- When comment notation exists in a string, e.g.
- When comment notation exists in a literal regular expression, e.g.
- When conditional compilation (supported in IE > 4) exists in the code, e.g.
var str = " /* not a real comment */ "; |
var regex = //*.*/; |
/*@cc_on @*/ /*@if (@_jscript_version == 4) alert("JavaScript version 4"); @else @*/ alert("Blah blah blah"); /*@end @*/ |
While the likelihood of any of the above happening is low it’s certainly worth catering to all potential situations; just encase one of them arises!
So, after a bit of googling and messing arround, it turns out that the only way of doing this properly is to loop through the code, character by character, checking for certain delimiters and then enabling/disabling modes as the loop progresses:
/* This function is loosely based on the one found here: http://www.weanswer.it/blog/optimize-css-javascript-remove-comments-php/ */ function removeComments(str) { str = ('__' + str + '__').split(''); var mode = { singleQuote: false, doubleQuote: false, regex: false, blockComment: false, lineComment: false, condComp: false }; for (var i = 0, l = str.length; i < l; i++) { if (mode.regex) { if (str[i] === '/' && str[i-1] !== '\') { mode.regex = false; } continue; } if (mode.singleQuote) { if (str[i] === "'" && str[i-1] !== '\') { mode.singleQuote = false; } continue; } if (mode.doubleQuote) { if (str[i] === '"' && str[i-1] !== '\') { mode.doubleQuote = false; } continue; } if (mode.blockComment) { if (str[i] === '*' && str[i+1] === '/') { str[i+1] = ''; mode.blockComment = false; } str[i] = ''; continue; } if (mode.lineComment) { if (str[i+1] === 'n' || str[i+1] === 'r') { mode.lineComment = false; } str[i] = ''; continue; } if (mode.condComp) { if (str[i-2] === '@' && str[i-1] === '*' && str[i] === '/') { mode.condComp = false; } continue; } mode.doubleQuote = str[i] === '"'; mode.singleQuote = str[i] === "'"; if (str[i] === '/') { if (str[i+1] === '*' && str[i+2] === '@') { mode.condComp = true; continue; } if (str[i+1] === '*') { str[i] = ''; mode.blockComment = true; continue; } if (str[i+1] === '/') { str[i] = ''; mode.lineComment = true; continue; } mode.regex = true; } } return str.join('').slice(2, -2); } |
The best way to wrap your head round the above code is to literally take it step by step. There are six modes; only one mode will be set to true
at any time during iteration; this activated mode respresents what construct is currently being looped through (a string, a regular expression, a comment etc.). The modes include:
mode.singleQuote
: Single-quote delimited string ('string'
).mode.doubleQuote
: Double-quote delimited string ("string
).mode.regex
: Literal regular expression (/regex/
.mode.blockComment
: Block comment (/*...*/
).mode.lineComment
: Line comment (//...
).mode.condComp
: Conditional compilation (/*@...@*/
).
Here’s an example trail through the loop:
Using string -> "a"" /*Boo!*/ 01. Double quote; *mode.doubleQuote* activated. 02. Letter 'a'; loop continues. 03. Character ''; loop continues. 04. Double quote; ignored because the previous character is an escaper. 05. Double quote; last character is not ''; so *mode.doubleQuote* de-activated 06. Space; loop continues. 07. Character '/'; Next character is asterisk; *mode.blockComment* activated - character replaced with an empty string 08. Letter 'B'; loop continues. - character replaced with an empty string 09. Letter 'o'; loop continues. - character replaced with an empty string 10. Letter 'o'; loop continues. - character replaced with an empty string 11. Character '!'; loop continues. - character replaced with an empty string 12. Character '*' followed by '/'; *mode.blockComment* de-activated - both characters replaced with an empty string Result -> "a"" |
There’s quite a lot of forward/back-tracking involved, that’s why a couple of arbitrary characters are added to either end of the string before the loop; to make sure something is there when str[i-2]
is queried.
Note: the code I used in the removeComments
function could be shortened; in fact, the entire function could probably be squeezed into 20 lines but that would only slow it down. Terseness does not always equal speed, especially so in this situation; a somewhat repetitive stream of IF statements really is the only way to produce acceptable performance.
I’d love to be proven wrong in this situation so if anyone can come up with an easier way of doing this I’d love to hear it! Especially if you think you can solve this with regular expressions alone!
Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!
I think you’ll come across similar problems when writing any kind of parser (and your script is actually a parser for JavaScript comments).
I’ve stumbled across it several times in the past, the last two while writing a syntax highlighter and the latest while writing a small parser for google-style search queries. In the first case (the syntax highlighter), I decided it’s not worth the extra resources for such edge cases, since it was mainly for my personal use and I wanted something fast, even if I had to sacrifice 100% correctness.
In the second, more recent case, I decided to take a similar approach as you did, since such cases were going to be really common and it’s also a commercial project, so I can’t have search queries failing due to lazy parsing.
I’m also really interested if there’s a better solution. I think perhaps it would be somehow possible by combining regex lookahead and lookbehind (in languages that support them, JS doesn’t support lookbehind ๐ ) but I’m not very experienced with those two.
This looks very handy indeed. Very useful for CSS and Javascript optimisers among other things.
I may be wrong, but it looks like it could still be tripped up with double backslashes… Eg. “a\”” /*Boo!*/
Actually, one of the first things you’ll learn in a compilers class is that no regular expression can take care of the comments problem, you need a full blown parser to do that. So i donยดt think you can do it any simpler way than going through every character, you can however, if you’re willing to do it in c/c++ or java, use something like lex to do it with only a few lines of code.
Thanks for the comments!
@Lea, I thought about whether or not it’d be achievable using regex lookaheads/behinds but I figured, even if JavaScript supported lookbehinds, it would be quite slow.
@Rick, you right; it would be tripped up by that, but it’s not valid JavaScript so I’m not bothered.
"\""
would throw a syntax error (since the firstis escaping the second
there is nothing left to escape the second
"
).@Vasco, It seems so; I read more about this technique over here: http://www.codeproject.com/KB/cs/jscompress.aspx – they seem to be using a similar method to remove comments. It’s funny how something that seems so simple can end up being quite complicated…
great thanks for that piece of code ๐ it helped me very much!
[…] while ago I posted a method I had been using at the time to remove comments from JavaScript code. It was pretty decent – […]