Regular Expressions

Last, but definitely not least, we come to the topic of Regular Expressions (RE), a not-so-simple pattern of characters that can be used to match a sequence of characters in a string. Combined with the right method, regular expressions can perform some pretty heavyweight text search and replace duties. This is not limited to checking that someone has typed in the right kind of phone number in a form element or validating some other kind of data. We could, for instance, search through a bank's records and add new digits to everyone's bank account number or discover the number of times Homer says "Doh!" in the screenplay for an episode of The Simpsons.

JavaScript has had built-in support for Regular Expressions since version 1.2 in the shape of the RegExp object and a few certain methods attached to the String object, all of which we'll come to later, but before we have a look at these, we need to nail down how exactly you construct such an expression.

Rolling Your Own RE

When it comes to producing your own regular expressions, you have two options in JavaScript – you can write them either as literals or as objects. For example

var myRE = new RegExp("R2D2");
var myRE = /R2D2/;

Both these lines do the same thing – assign myRE with a reference to a newly created RegExp object whose expression will match an instance of the sequence "R2D2" in a string. The RegExp object contains the data for the RE you specify when that object is created. Easy. However, you can't really make use of the full power of regular expressions until you learn their alphabet and syntax.

The first point easily learned is that with the exception of two switches, every RE literal is contained within a pair of forward slashes.

var blankRE = /  /;

The two switches mentioned earlier are g and i, and affect directly how the search to match your regular expression is conducted. g, the global switch, tells the search to find every instance of your character sequence in the target string, rather than just to find the first and then stop looking. i, meanwhile, tells the search mechanism that the search is case insensitive. For instance:

var myRE = /R2D2/;       // finds first instance of R2D2
var myRE = /R2D2/i;      // finds first instance of r2d2, R2d2, r2D2 or R2D2
var myRE = /R2D2/g;      // finds all instances of R2D2
var myRE = /R2D2/gi;     // finds all instances of r2d2, R2d2, r2D2, or R2D2

With that out of the way, we'd better look and see what we can put inside the slashes.

The RE Alphabet

The alphabet for regular expressions incorporates all the alphanumeric characters, upper and lower case, and quite a few other special characters in the form of escape sequences, as shown below. Note that an escape sequence may match one or more ordinary characters or alternatively a special condition that isn't an ordinary character, like the start of a string, as we'll see in the pages to come.

Character to Match Corresponding Escape Sequence
Any alphanumeric character (a-z, A-Z, 0-9) Itself
Any of . ? / \ [ ] { } ( ) + * | \ followed by the character. For example \{ matches {
Form feed \f
New line \n
Carriage return \r
Horizontal tab \t
Vertical tab \v
ASCII with octal character number Octal \oOctal
ASCII with hex character number Hex\xHex
Control-x where x is any control character \cx
The beginning of a line or string ^
The end of a line or string $
A word boundary \b
Not a word boundary \B
Single white space. (tab, space, etc) \s
Single non white space \S
Wildcard character. Anything but a new line .

You should be familiar with what the first ten of these items will match up to, so to demonstrate the rest, let's take an example text

Jimmy the Scot scooted his scooter through the park.

The Parky watched Jimmy do this

and we'll go through some easy examples:

var RE1 = /^Jimmy/;         //matches "Jimmy" on line 1 but not line 2
var RE2 = /his$/;           //matches "this" on line 2 but not "his" on line 1
var RE3 = /\bt/;            //matches " the" and " through" with front space
                            //but not "Scot" or "watched"
var RE4 = /\Bt/;            //matches "Scot" or "watched"
                            //but not " the" or " through"
var RE5 = /t\s./;           //matches "Scot scooted" but not "watched"
var RE6 = /t\S./;           //matches "watched" but not "Scot scoo"

From these examples, you can see that each character or escape character between the forward slashes of a regular expression stands for a single character only. Some escape characters will match more than one kind of normal character such as the whitespace escape sequence, but only match one character in total.

RE Syntax - Shortcuts and Options

If you've kept up with this discussion so far, you should have realized that this is not a very flexible system yet. Apart from the wildcard character – a period or full stop – we've yet to proffer any notation that specifies an option on single characters. For example, let's take a brand new string we want to match our regular expressions against.

dink fink link mink oink pink rink sink tink wink +ink _ink "ink

Now, suppose we wished only to match link, mink, and wink from the string above. From what we know currently, the only option we have is to create three regular expressions – one for each word. Using the wildcard in /.ink/ will match every word in the string and that's not what we want.

There is, of course, a solution. By using square braces, [], we can specify which group of letters we want to match a certain character. So then we should use

var RE7 = /[lmw]ink/;       //matches "link", "mink" and "wink"

to get the desired matches. At the same time, we can also use 'not-this-group' [^ ] to match every other word in the string.

var RE8 = /[^lmw]ink/;      //matches "dink", "fink", "oink", "pink", etc
                            //but not "link", "mink" and "wink"

This is all very well if you've got a few characters that you'd like to check in a single space, but when you've got ten or twelve, your RE is going to start looking quite unwieldy (or should that be unwieldier?). Fortunately, we can also specify ranges of characters using the hyphen, -, within the braces as a marker to indicate the range. For example, to match every word apart from tink and wink in the string, you could either of:

var RE9 = /[a-s]ink/;
var RE10 = /[^t-z]ink/;

By now you might have realized that in using a hyphen, [0-9] represents all the numerals, [a-z] the lower case letters and [A-Z] the uppercase. Again, we have some shortcuts for particular ranges of characters that make things more compact.

ShortcutRepresents the range..
\d [0-9] Any number character
\D [^0-9] Any non-number character
\w [0-9a-zA-Z_] Any letter, numeral or underscore
\W [^0-9a-zA-Z_] Any non-alphanumeric character except the underscore
\s [ \f\n\r\t\v] Any whitespace character
\S [^ \f\n\r\t\v] Any non-whitespace character. NB Not the same as \w

There is one more thing to mention in this section - the alternation of the or symbol (|). Equivalent to the or operator (remember ||?), the alternation symbol matches any one group of symbols out of several groups specified. For example /(ab|cd)/ will match either two-character sequence. We use the parenthesis to remind ourselves how the two parts of the alternation are related – the use of parentheses here is used to add clarity in the same way that you can use parenthesis in common mathematical expressions. Let's look at some examples.

var RE11 = /^([1-9]|1[0-2]):[0-5]\d$/;      // Matches proper time values
var RE12 = /['"]/d/d/d['"]/;         // Matches a three digit number in quotes
var RE13 = /[123]|[^123]/;           

Our last example, RE13, serves little purpose other than to illustrate the or symbol. In effect, it does the same job as the wildcard character..

RE Syntax - Repetition and References

There are still some limitations to what we can do with regular expressions, the most notable being that we have to write out explicitly how many characters we're searching for in our pattern. This is fine for small, well-defined patterns like example RE11 for matching time strings, but say we wanted to match any string in quotes and you are facing quite a challenge to match it with a group of expressions like RE12.

The solution comes from a number of special characters used to denote the repetition of a letter, number, etc as follows.

Characters Meaning
? Either zero or one match only
* Zero or more matches
+ One or more matches
{n} Exactly n matches
{n,m} No less than n and no more than m matches
{n,} At least n matches

Each of these characters should come directly after the character or group that you wish to denote is repeated as is demonstrated below.

var RE14 = /g?nash/;               //Matches gnash or nash
var RE15 = /g*nash/;               //Matches nash, gnash, ggnash, gggnash etc.
var RE16 = /g+nash/;               //Matches gansh, ggnash, etc but not nash
var RE17 = /go{2}p/;               //Matches goop, not gop or gooop
var RE18 = /go{3,}p/;              //Matches gooop, goooop, not gop or goop
var RE19 = /go{1,3}p/;             //Matches gop, goop and gooop only
var RE20 = /['"][^'"]*['"]/;       //Matches a quoted string of length 0-?

You can also specify a sequence of characters to be repeated by enclosing them in parentheses like so.

var RE21 = /\d{5}(-\d{4})?/;         //Matches zip codes

The use of parentheses brings us on neatly to our final piece of syntax – reference markers. Consider the strings 212-555-212, 628-932-628, "Quote 1" and 'Quote 2'. Each of them has a (series of) character(s) repeated after some others. Regular expressions allow us one more match test on strings – to match character strings against those stored elsewhere in the same string. In effect, we can search for a matching pair of quotes or brackets and subsequently do something with the contents. For example, we can ensure a user's name matches that written on his credit card or, in the case of calculating mathematical pi (p) that no single digit is repeated more than twice in a row. We do this with references.

Let's take the number example. The regular expression to match those numbers mentioned earlier would be /\d{3}-\d{3}-\d{3}/. In order to match the first group of three numbers with the last, we would alter that expression slightly to /(\d{3})-\d{3}-\1/. The new escape sequence here, \1, seeks out the first expression within a set of parentheses and finds out the match for that expression. Then it tries to match a later sequence of characters against that earlier qualifying match.

These reference escape sequences can refer to the nth set of parentheses in a regular expression, so it's not uncommon to see \2, \3, \4 etc in more complex expressions. In the case where the parentheses are nested \n refers to the pair of parentheses that begins with the nth left parentheses starting from the left. One final word of caution here. Should you use, for example \10, in your expression when there are not ten sets of parentheses present, JavaScript will take this as meaning the ASCII character with octal value 10 - in this case, a backspace. There are only 9 maximum matches allowed.

To conclude, regular expressions are built out of pattern matching elements that together form more sophisticated patterns. The pattern matching process takes a candidate string and matches it from left-to-right against the supplied regular expression pattern. If there is any possible match at all, the string matches. If there is no match, you can be sure there is no possible interpretation of the pattern that could work for the string. This is an important point you can rely on if the regular expression, or the string it works on starts becoming complex.