Regular Expressions in JavaScript, part 2

A while ago, when I was just getting used to this insanely complicated stuff, I posted a brief introduction to the world of regular expressions. I’m glad to say that, since then, I have learnt a bunch more about them and how you can make use of them within JavaScript. So, here goes:

In JavaScript, there are four string operations that will accept a regular expression as an argument:

String.match(), – this method only accepts a regexp as the first argument. It’s usually used to extract specific parts of a string or to test whether a string matches a regular expression.
String.replace(), – this method accepts either a string or a regular expression as its first argument, and accepts either another string or a function as its second argument. It’s usually used to find and replace certain parts of a string.
String.split(), – this method accepts either a string or a regular expression as its first arguments, the second argument is used (rarely) to signify a limit for the split operation. It’s used to split a string into an array based on the regular expression and/or string passed as the first parameter.
String.search(), – this method accepts a regular expression as its first and only argument. It’s used to find the index of a regex match within a string.

The RegExp object has its own methods:

RegExp.exec(), – this method is exactly the same as the String.match() method, the only difference being that you pass the string as the argument and the method is run as a member of the regular expression that you’re using to search the string.
RegExp.test(), – this method is similar to the above exec, but instead of returning the match found it will return either true or false dependent on whether or not its found a match.

Correction: Luke pointed out in the comments that String.match and RegExp.exec are slightly different in that the latter will return capture groups plus the first match if a global flag is used, while the former (match) method won’t return any capture groups; only the full matches.

Because I know no better way to begin, let’s start with a basic example:

Validating user input

One of the most common uses for regular expressions on the client-side is validating user input. Let’s say we need to validate a product ID… We’ve had to leave it up to the user to type it in because there are over 5000 products. All product ID’s start with either the letter ‘M’ or ‘D’ followed by 4 or 5 digits and then an additional trailing letter to signify upgrades and variations. Validating such an input would be perfectly possible without using a single regular expression, as shown here:

var usersProductID = 'M5060i';
 
function isLetter(character) {
    return ('abcdefghijklmnopqrstuvwxyz').indexOf(character.toLowerCase()) > -1;
}
 
function isValidKey(character) {
    return ('md').indexOf(character.toLowerCase()) > -1;
}
 
var isValidProductID = (
        isValidKey(usersProductID.substr(0,1))
        && (!isNaN(usersProductID.substr(1,4)) || !isNaN(usersProductID.substr(1,5)))
        && isLetter(usersProductID.substr(usersProductID.length-1))
    );
 
alert (isValidProductID); // Boolean, true or false...

Now, with a regular expression:

var usersProductID = 'M5060i';
 
var isValidProductID = /^[md][0-9]{4,5}[a-z]$/i.test(usersProductID);

Hopefully the above example has demonstrated the necessity and importance of regular expressions in JavaScript (if you weren’t already convinced). Here’s a commented version of our regular expression:

^        - Matches the start of a string
[md]     - Character class that matches 'm' or 'd'
[0-9]    - Character class that matches any digit 
{4,5}    - Repeat last character ([0-9]) 4 OR 5 times
[a-z]    - Character class that matches any letter
$        - Matches the end of a string

In JavaScript there are two ways of defining a regular expression, using its constructor, or literally:

// Constructor:
var myRegexp = new RegExp('^[md][0-9]{4,5}[a-z]$', 'i');
// Literal:
var myRegexp = /^[md][0-9]{4,5}[a-z]$/i;

The only situation in which you’d want to use the constructor would be when you need to add varying data to the regular expression. If it’s constant and does not change then stick with the RegExp literal (/regex goes here/)

The ‘i’ that you see is a flag. Flags are either passed as the second argument to the constructor or, if you’re using the literal syntax, they’re specified beyond the right-hand delimiter (forward slash) of the expression. The ‘i’ flag in particular means ‘ignore case’, so an ‘a’ in the regular expression will match both ‘a’ and ‘A’ in the string that’s being tested. The available flags include:

i, – “ignore case” – the case (uppercase/lowercase) of all letters within the string will be ignored during testing.
g, – “global search” – the search is carried out across the entire string, regardless of whether a match has already been found.
m, – “multiline search” – the regular expression will match over multiple lines.

String extraction

I couldn’t come up with a good name; “string extraction” seems suitable, although it sounds a bit dodgy if not in the context of programming. Anyway, back to the point: regular expressions are not only useful in validation; you can extract very precise pieces of information from string data. Let’s say, for example, we have to extract all numbers from a massive string and produce an array from them:

Since the String.match() method returns an array of all the matches this is incredibly easy:

var theString = 'Dr. Average has 78 patients, and only 12 of them think he's a good doctor!';
var allNumbers = theString.match(/d+/g); // d is just a shortcut to [0-9]
 
// allNumbers = [ 78, 12 ]

Notice that we’ve used the g (global) flag, without it we’d only get one match. The only problem with this is that it matches numbers in the middle of other words, like “foo299bar” – even though this might be a rare occurance it’s still important to take it into account. We can eliminate this problem by specifying, in our regular expression, that the digit characters should come straight after a word boundery, i.e. the position between a space and the start of a word (in regular expressions a word is anything that contains letters and/or digits):

var theString = 'Dr. Aver3age has 78 pa555tients, and only 12 of them thi6nk he's a good doctor!';
var allNumbers = theString.match(/bd+b/g); // b stands for boundary
 
// allNumbers = [ 78, 12 ]

Even though the string is full of words interspersed with numbers the resulting array still only has ’78’ and ’12’ in it; exactly what we’re after!

Extracting URLs from a string:

A more realistic application of this technique might be when searching for URLs within a string. URLs can contain a whole variety of characters and checking the validity of a URL is a huge task; our regular expression is going to be a “dumb” one because we haven’t got the time to study the intricacies of the URI specification. Here’s the string we’ve got to work with:

var theString = 'Hey! Please visit http://google.com and http://www.sitepoint.com';

The first thing we want to look for is ‘http://’ so that can be the start of our regular expression:

var urlRegex = /http:///g;

Since forward slashes are used to delimit an expression we need to escape those which are to be taken literally (as actual characters). We can improve this by adding support for some other protocols:

var urlRegex = /(f|ht)tps?:///g;

Now it supports http://, https://, ftp:// and ftps://. I think that’s enough to get us started…

Now, like I said, this is going to be a dumb regular expression and so it won’t be suitable for many situations. A more “intelligent” one would specify all valid characters in order. Next we want to look for a space, i.e. where the URL probably ends:

var urlRegex = /(f|ht)tps?://.+?s/g;

I’ve just added .+?s which translates to: “One or more of any character but as soon as a space is encountered, stop!”

If we try it as it currently stands here’s the result we could get:

theString.match(urlRegex); // [ "http://google.com " ]

So we’ve only matched the first URL. The second URL isn’t being matched because it’s at the end of the string and so there is no space (s), we can test for this with the dollar symbol ($ will match if at the end of a string):

var urlRegex = /(f|ht)tps?://.+?(s|$)/g; // matches s OR $ at the end...

If we try it now, here’s what we get:

theString.match(urlRegex); // [ "http://google.com ", "http://sitepoint.com" ]

We can get rid of any trailing spaces using the ‘replace’ method:

var matches = theString.match(urlRegex);
for (var i = 0, len = matches.length; i < len; i++) {
    matches[i].replace(/^s+|s+$/g, '');
}
 
// The above code will find and replace all spaces near
// the beginning and/or end of the string. (i.e. trimming)

Find & replace with style…

The most basic application of the ‘replace’ String operation is passing two strings, one will replace the other:

"I'm tired".replace('tired', 'sleeping'); // "I'm sleeping"

What’s cool about it is that you can also use regular expressions to search for the item you want to replace:

"I wonder how I can remove (these brackets)...".replace(/((.+?))/, '$1');
// Removes the brackets

You can reference groups using ‘$1’, ‘$2’, ‘$3’ etc. in the replacement string. Groups are specified in regular expressions using brackets. The above regular expression is easier to understand when laid out:

(     - Literal left bracket
(.+?)  - Group containing any character
            - '?' makes it non-greedy, so it will stop when
              the next character is encounted...
)     - Literal right bracket

So, ‘$1’ in the replacement string is referencing all characters found between the brackets. In some circles these replacement keys ($1, $2 etc.) are known as “backreferences”.

Another incredibly useful thing about the ‘replace’ method is that you can pass a function as its second parameter; this function will be run each time a match is found within the string:

"-moz-border-radius".replace(/-w/g, function(match){
    return match.replace('-','').toUpperCase();
});
// Returns MozBorderRadius
 
// Made into a function:
 
function camelCaseCSS(property) {
    return property.replace(/-w/g, function(match){
        // return match.replace('-','').toUpperCase(); // Old way
        return match.charAt(1).toUpperCase(); // New way, suggested by Luke
    });
}

What we’ve got is actually a pretty useful function that converts CSS property names to their DOM counterparts. E.g:

camelCaseCSS('background-color'); // => backgroundColor
camelCaseCSS('font-family');      // => fontFamily
camelCaseCSS('line-height');      // => lineHeight

Finito!

Thanks for reading!

For more information on regular expressions within JavaScript visit:

Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!

Diogok March 23rd, 2009 at 11:25 pm

awesome!

Ben Nadel March 23rd, 2009 at 11:40 pm

Awesome stuff. I am a huge fan of regular expressions in Javascript. I have not given the String.match() method a go yet. I will have to try that.

Baa March 24th, 2009 at 12:08 am

Great post James. Finally somebody encouraged me to really start learning regular expressions. Thanks! 😉

eugene March 24th, 2009 at 2:26 pm

great post
after reading your regex on testing for http, https, http://ftp...

i dug up an old regex i had for this, and seems completely bloated compared to your. great work.

here is my old one (stripped from a class)

// site
Site:function()
{
	var pattern = /^(([f]|[h])(t?tp)(s?)(://))?((www)(.))?/;
	var site	= window.content.location.host;
	site		= site.replace(pattern, '');
	return site;
}

Graham B March 24th, 2009 at 3:24 pm

Huh, I never knew replace() could accept a function. Nice one 🙂

Valentino March 24th, 2009 at 9:33 pm

I suggest this online tool for testing and creating RegEx:

http://gskinner.com/RegExr/

When I started using it, creating regular expressions became actually fun!

Luke March 26th, 2009 at 6:17 am

Great post. Regex is definitely something that more js developers need to learn. One correction and a couple suggestions for you, though…

Correction: str.match and regex.exec are only the same if the regex is not compiled with the g flag. Against a g regex, match does not return capture groups, but every instance of the full match. exec returns the first match and any capture groups and updates the regex’s lastIndex Property. Future calls to exec will start from that index (useful in a while loop).

var re = /f(w)1/g, str = "Calling all foo, calling all faa";
str.match(re); // ["foo","faa"]
str.match(re); // ["foo","faa"]
re.exec(str);  // ["foo","o"] and re.lastIndex === 15
re.exec(str);  // ["faa","a"] and re.lastIndex === 32

Suggestions:
Unless you specifically need the data in your groups, use a non-capturing group, vis /(?:f|ht)tps?/ won’t internally store or return the f or ht string. It makes for marginally faster code, but more importantly, you don’t have to specify unused variables for dead captures if you have important capture groups afterward.

And your camelCaseCSS function doesn’t need an inner regex. charAt(1) will do the work much faster.

function camelCaseCSS(property) {
    return property.replace(/-w/g, function(match){
        return match.charAt(1).toUpperCase();
    });
}

Another great resource for advanced regex know-how is Steven Levithan’s blog Flagrant Badassery http://blog.stevenlevithan.com/

James March 26th, 2009 at 9:26 am

Thanks for the comments!

@Valentino, that looks like a very useful tool, thank you!

@Luke, thank you for the suggestions! I had no idea exec and match behaved differently. I’ll add that to the post. Also, just subscribed to http://blog.stevenlevithan.com/ – looks great!

TJ Loh September 2nd, 2009 at 5:16 pm

Great Post! Thanks