logo To Foot
© J R Stockton, ≥ 2010-02-25

JavaScript RegExps & Validation.

No-Frame * Framed Index * Frame This
Links within this site :-

See "About This JavaScript Site" in JavaScript Index and Introduction.

Regular Expressions

NOTE : This page is not intended to teach the use of Regular Expressions (RegExps) on its own; parallel access to a manual page or similar, and preferably to a well-written book or tutorial, is assumed. It uses JavaScript, and not VBScript.

RegExps are American-initiated : therefore, letters were taken to be the 52 characters A-z a-z only. Some systems may do a little better.

The term "Regex" is sometimes used; but "RegExp" is correct for JavaScript.

RegExp Links

I would hope that Regular Expression Pocket Reference (O'Reilly)
would be recommendable, too - but I've not AFAIR seen it.
See http://oreilly.com/catalog/9780596514273/index.html.

Remember that manufacturers' sites generally describe their latest features at the time of writing. Therefore, some of the features which they describe may not yet be safe for Internet pages.

Note that some of the above are rather old, and so omit some features which are now fairly safe to use.

Uses of RegExps

Validation is a major use, but not the sole use, for RegExps.

RegExps are extremely useful in string handling; they can be used for parsing, format checking, substitution, and field extraction. They can be powerful, yet they are easy enough to use for comparatively simple tasks. They are implemented in a number of programming languages, and also in editors and in editing tools such as SED and MiniTrue.

RegExps in General

RegExps are used to match a specified pattern against the contents of a string, giving information as to what, if anything, was found.

A RegExp consists of a cabalistic sequence of characters and metacharacters. Characters A-Z a-z 0-9 and some others represent themselves. Metacharacters include ^ $ . \d \D for beginning, end, any character, decimal digit, not decimal digit. The character \ means that the following character(s) do not have their usual meaning; so \\ means the actual character \ .

It is commonly useful to parse a string with a RegExp as an aid to splitting it into fields. See, for example, the source-code of Date and Time 4 : Validation.

Other examples are below, and in JavaScript Tests, JavaScript/HTML/VBS Quick Trials and other pages, and in various batch files using MiniTrue (mtr).

For RegExps in VBScript, see Regular Expressions.

RegExps in JavaScript

JavaScript RegExps are Objects with Methods and Properties, and can be used as parameters for functions and for methods of other Objects. RegExps do require, I believe, JavaScript 1.2, which means that common browsers must be at least version 4.

A RegExp is created by new RegExp("~~~") or given as a literal by /~~~/ .

For the lastIndex property with the g flag, see RC in CLJ.

Note that RegExps have "hidden variables". In particular, a repeated search may not start at the beginning.

Modifying flags i g can be applied, for case-independence and global-matching.

A user of this page will need access to a syntax reference list; a Web search for "Regular Expression Syntax" should find one of the standard lists of metacharacters.

Basic Examples

Standard Methods Using RegExp Matching

	Object	Method(arguments)	Result		Fail

	RegExp	.exec(S)		Array		null
	RegExp	.test(S)		Boolean		n/a
	String	.match(R)		Array		null
	String	.replace(SR, SF)	String		n/a
	String	.search(R)		Number		-1
	String	.split(SR)		Array		n/a

R is a RegExp, obtained if necessary by using new RegExp()
S is a String
SF is a String or a Function
SR is a String or a RegExp

The methods of String are generic, and can be transferred to other Objects.

Changing Features

Incoming

In a RegExp, ? can be used for a "non-greedy" match. The feature was only introduced at MSIE 5.5 and NN 6, and was thus rather unsafe on the open Internet. See RegExp Feature Testing below.

Note that look-ahead is also relatively recent.

x = /^(?!(.*abc.*))/.test("555abc888") // S has no substring "abc"
Outgoing

The RegExp.$1 notation is now deprecated in JavaScript; it seems that Res[1] has to be used, where Res is the result of the exec method of the RegExp. Alternatively, perhaps use Res = String.match(RegExp). In Opera 9.64 snd 10.00 at least, although X = RegExp.$1 works, with (RegExp) { X = $1 } appears not to.

I have heard that in JScript.NET with ASP.NET the RegExp.$1 notation is not available.

RegExp Feature Testing

Calling for unavailable RegExp features can result in wrong results or an error message. In IE 4, for example, both writing /a??/ and without protection executing new RegExp("a??") caused an "unexpected quantifier" message. Results of executing RegExps containing unsupported features may differ among browsers; there may be no error message. See whether the FAQ of news.comp.lang.javascript, and/or its notes, yet help.

In IE4, this safely detected the lack (for later browsers only, consider try/catch) :-

<script>

function catchError() { alert('Caught') ; return true } // shown in IE4

testRe = "Before"
var onerrorSave = window.onerror
window.onerror = catchError
testRe = new RegExp("a??")	// Caught in MS IE 4, OK in IE6
window.onerror = onerrorSave
alert("1 " + testRe)		// not shown in IE4
				// shown as "1 /a??/"  in IE6

</script>

<script>

alert("2 " + testRe)		// shown as "2 Before" in IE4
				// shown as "2 /a??/"  in IE6

/* P.S. 20061204 : typeof testRe → "object" / "string"
   but start with testRe = "" and finally use OK = !!testRe */

</script>

Extended RegExp literals must not be written in code intended to work with IE4.

RegExp \S \s Testing

The set of characters matching \s (and so not matching \S) is browser-dependent. Note that there may be space-like characters not recognised by \s in any tested browser.

Characters are here represented as four-digit Hex code. Following script generates a string containing all 65536 Unicode characters from 0000 to ffff. Then it collects and reports all those which match \s. Also, \w is fully tested.

Set rows onChange :   Note the scrollbar. My tests are with Win XP sp3.

This processes the editable green list above, reading only characters in [ ... ]
and maybe in { ... }.




    For those systems, \s matches all in "Each";
\S matches all not in "Some".

In browsers, the first six characters found are generally 0009 000a 000b 000c 000d 0020 = HT, LF, VT, FF, CR, Space. All characters, selected by number, are explained via Unicode Code Charts, using the look-up at the top, and characters are also listed alphabetically by name.

0009 = HT
000a = LF
000b = VT
000c = FF
000d = CR
0020 = space
0085 = next line
00a0 = non-breaking space
1680 = Ogham space mark
180e = Mongolian vowel separator
2000 - 2000b = spaces of
  different sizes, including zero
2028 = line separator
2029 = paragraph separator
202f = narrow no-break space
205f = medium mathematical space
3000 = ideographic space
feff = zero-width no-break space
That appears to be a complete list of the 28 Unicode blank characters.

Unicode includes some non-blank characters representing spaces, and 0000.

See, at The Unicode Consortium, Unicode Character Properties, listing 26 characters as "White_Space" (page omits 200b & feff).

See also Whitespace deviations.

Testing RegExps

A common programmer fault is checking that a RegExp (or other) test matches what it should match, but failing to check that it also does not match all that it should not match.

A RegExp Tester

RegExp Tester


    M = St.match(new RegExp(RE, Fg)) ; OK = !!M

   

LF shows code for the corresponding operation with a RegExp literal, as commonly used.

A RegExp can be tested in .match() and then used in any of the methods listed above.

Some Simple RegExp Applications

Validation, as below.

Number and Date Re-Formatting

Number : in JavaScript Maths.

Date : in Date and Time 3 : Input and Lengths and Date and Time 9 : Output Formatting.

Counting Characters

To count characters of a specified type in a string, replace the rest with nothing and take the length of what is left; alternatively, use .match. To count lower-case English letters, use :-

count = msg.replace(/[^a-z]/g, "").length
count = msg.match(/[a-z]/g).length // if known that count > 0
count = (T=msg.match(/[a-z]/g)) ? T.length : 0 // otherwise

Counting Words

To count words in a string, replace each word with "a" and then each gap with nothing, then take the length of what is left : so, if \S adequately defines a word character, use :-

count = msg.replace(/\S+/g, 'a').replace(/\s+/g, '').length

Note - the definition of "word" is not altogether easy : consider "cat's-paw".

To count a specific computed substring SS :-

count = (msg.match(new RegExp("("+SS+")", "g")) || "").length

Note that innerHTML may not be exactly as expected.

To count occurrences of a word in a textarea using a RegExp and modifiers supplied as strings :-



Debug :    

Fixing Decimal Places

S = F.X2.value
OK = /^\d{1,4}(\.\d\d?)?$/.test(S)
if (OK) S = S.replace(/(\.\d)$/, "$10").replace(/^(\d+)$/, "$1.00")

The first line gets a string; the second checks it for having 1-4 digits, optionally followed by a fractional part of one or two digits; the third, which may not be optimum, normalises it to have two decimal places. It can be tested in JavaScript/HTML/VBS Quick Trials.

Validation

General

For business purposes, where for example a Web page is a client passing information to a server, remember that client-side validation can be suborned. Client-side validation is a convenience for the honest customer, but only server-side validation can be trusted.

Ensure that such forms do not fail if JavaScript is not enabled.

Remember that users may be in different countries, and accustomed to differing formats.

Indeterminacy

It is not always practical to classify all possible input strings as being either an undoubted pass or an undoubted fail. It is then necessary to decide whether the indeterminate cases should either be treated as pass or fail or referred for user decision.

In the case of dates, for example, there is no need to accept all possible valid formats, as long as at least one valid format is indicated and the indicated formats all pass; the user can always try again until compliant.

For E-mail addresses, however, where the necessary formats may change during the life of the code, one should do no more than reject any which it is thought cannot possibly be right - blank, for example. It may be well only to warn if an address seems incorrect, allowing it to be used if the user agrees.

Efficiency

Validation for character pattern is almost always best done with a RegExp, and other validation is often easier if a RegExp is used first.

Before or after a field has been validated, the value should generally be moved to a simple local variable as an appropriate type. There is no need for repeated reference to such as document.Form1.Thingy.value or document.forms["Form1"].elements["Thingy"].value; for example, if that's known to be a number, use such as var Val = +document.Form1.Thingy.value or for a CheckBox var OK = document.Form1.ThatBox.checked.

Code should not be repetitive; so use functions with parameters to handle different fields with common code.

Radiobuttons

The only possible need is to ensure that a selection exists.

Button 0 is not displayed.   1 2 3  
Code :   !this.form.RB[0].checked :

Checkboxes

The only general need is to count selections.

0 1 2     Count :
Code :   var Q=0, J=CB.length ; while (J--) if (CB[J].checked) Q++ ;

Date/Time

Date/Time Validation is via Date and Time Introduction. It can be done by using a long and complicated RegExp, but it is far better done by using a RegExp for pattern, then a Date Object (likely to be wanted anyway) month check to check that the digits are acceptable.

Numeric Patterns

One should not allow a decimal point with no digit before it; nor with no digit after; such constructions are liable to be mistakes. See, for example, SUNAMCO 87-1 (IUPAP-25), 1987 revision, by Cohen & Giacomo, Section 1.3.2.; either it, or a successor, ought to be on the Web somewhere.

If a - sign is allowed, a + sign should be allowed and might be mandated.

The standard function isNaN() is frequently too general. It is often best to use a RegExp for pattern validation, at least initially, and also for separating fields; but generally not for checking numeric values. Function isFinite() may also be useful here.

RegExp examples (R) :-
  /^\d+$/                	 	All-digit
  /^[1-9]\d*$/            		All-digit, non-zero
  /^\s*[-+]\d+\s*$/       		Unbroken signed integer + spaces
  /^\d{1,5}$/             		1 to 5 digits
  /^\d+\.\d\d$/				As currency, 2 decimals
  /^\d+(\.\d{2})?$/        		As currency, 2 decimals optional
  /^[+-]?\d+(\.\d+)?(e[-+]?\d+)?$/i	Allows +3 -3.7 -3.4e+56 etc.
  not /\..*\./				Fewer than two dots
  /^\d{1,3}(,\d\d\d)*\.\d\d$/           Thou-Sep Number dot 2 decimals

Sample usage :-
  T = document.forms["FrmQ"].InputID.value // String
  OK = R.test(T) // Use a RegExp on T      // Boolean
  // If error, prompt and go back
  InputValue = + T		// Number, or
  InputValue = + RegExp.$n      // for n in 1..

The RegExp Tester above can be used.

An earlier version of this section was in JavaScript Maths.

Testing Number Formats

See also in JavaScript Maths on input of numbers.

Integers

will test St for being all-digit, for example - it returns (not (a-non-digit is-found in-the-string)) - but it will accept an empty string.

To test for being at least one digit and nothing else, and obtain and test the value :-

Decimals

A non-integer decimal number should have at least one digit on each side of the separator, which I here assume to be a decimal point "." - GoodDecimal() tests such; for MayHavePoint() the fractional part is optional.

Cash() should test currency formats - it requires consecutively : the-beginning one-to-three-digits (comma three-digits)-any-number-of-times dot digit digit end OR the-beginning at-least-one-digit dot digit digit end. To remove such commas, one can use .replace(/,/g, '') .

Conversion to Number

Once shown to be acceptable, and comma-free, such a string can safely be converted to a number with the unary + operator.

Floating-Point Numbers

Here it usually best to check by attempted conversion.

If the number must be given in floating-point notation, then test for the presence of the letter E. One can also test for the presence of a decimal point, and, if so, that it has a digit on each side; one can test for a leading sign, or one after the E. Such tests are usually unnecessary.

Assume that the input is a text box be F.X0 ; then the best move seems to try numeric conversion first. There will normally be a range of allowed values, given by a least value L and a greatest G. The following should suffice :-

x = +F.X0.value
OK = x >= L && x <= G

which, if I am not mistaken, correctly handles NaNs and Infs.

Returning Boolean

After Bogdan Blaszczak :-

function ChkStr(RE, St, Msg) { return RE.test(St) || !!alert(Msg) }

OK = ChkStr(/^\d+(\.\d{2})?$/, "666.65", "NoGo") // currency

Currencies

Amounts of currency (in currencies with 100 minor units per major unit) should always be written either as an integer or with exactly two decimal places. The representation of the decimal point varies, and thousands separators may be allowed. Amounts smaller than 1.00 must have a zero before the decimal separator.

The basic RegExp is of the form /^\d+(\.\d\d)?$/ .

IP Addresses

function BadIP(S) { var M, J, T // e.g. "004.02.155.133"
  M = S.match(/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/)
  if (!M) return "bad pattern"
  for (J=1; J<=4; J++) { T = +M[J]
    if (T==0 || T>254) return "bad #"+J /* Range limited */ }
  return null }

There should at least be a check against T>255. The result can be converted to Boolean with the ! operator.

Telephone Numbers

Telephone numbers are not all of the form (###) ###-#### - not even in North America, since Mexican land-lines, and mobile numbers, are different. Most telephone numbers, in fact, differ.

International standards have permitted a surprisingly large character count. On the other hand, internal telephone numbers in small countries can be rather short.

Consider mobile phones, and foreign networks, and country code; and maybe the possibility of "extra" numbers for internal routing.

For input, be adequately liberal in all respects.

For output, note that (from within the UK) London numbers are of the form 020 ABCD EFGH where A can be 3, 7, or 8. They do not start 0203 / 0207 / 0208 (a London number has 8 local digits, and its area code has 3). The best form for quoting a London number world-wide is +44 (0)20 dddd dddd .

Consider
	X = TN.replace(/\D/g, "").length
	OK = X>=4 && X<=15
where TN is the putative number and 4 is a fairly safe guess.

Combined, perhaps, with something like
	OK = /^[0-9()+ -]*$/.test(TN)
if there is a need to accept only reasonable punctuation.
J'ai lu en <48d7d416$0$30969$426a74cc@news.free.fr> :
Pour information, les chaînes E164 sont définies ainsi :
  <pattern value="(\+[0-9]{1,3}\.[0-9]{1,14})?"/>
  <maxLength value="17"/>

For detailed advice, try reading Telephone number and ITU recommendations.

Alphanumeric Patterns

  /\s/ 		               	 	Has any whitespace
  /\S/ 		               	 	Has no whitespace
  /\w/ 		               	 	Has any in A-Za-z0-9_
  /\W/ 		               	 	Has any not A-Za-z0-9_

Passwords

To read a password, use an input field of type password.

Passwords should not be transmitted en clair over insecure links, nor stored in a manner from which the original can be deduced. When a password is entered, it should immediately be converted with a one-way function, and it is the result of that which should be stored for comparison.

Client-side password checking is generally insecure, but proposed passwords can properly be assessed client-side.

A password should be of reasonable length and free of problematic characters (spacing, accented characters, punctuation). Often it is stipulated that it should contain at least one digit and at least one letter, perhaps at least one letter of each case. Note that \w accepts underline.

For assessing a proposed new password, consider :-

x1 = /^[a-z\d]{6,10}$/i // only alphanumerics, and length 6-10
x2 = /[a-z]/i           // a letter present
x3 = /\d/               // a digit present
OK = x1.test(w) && x2.test(w) && x3.test(w)

It is generally a mistake to attempt to do the full check in a single RegExp.

E-Mail Addresses

There is no need for full validation when an E-mail address is created. Much of it will be standard, and the rest can be more restricted than the RFCs allow. When a Web form calls for an E-mail address, partial validation is possible and reasonable.

See Wikipedia E-mail address.

Full Validation is Not Practical

One cannot in a Web page fully validate an E-mail address; for proof of this, it is sufficient to observe that, with dial-up Internet machines, it is possible for an actual address to become valid or invalid by a change in data on a machine which is currently isolated from the Net.

Moreover, the set of allowable top-level domain names (TLDs) is subject to change.

It is possible to use a complex RegExp method to test validity against any given format. For this to be useful, it must give results in exact agreement with all applicable RFCs and practices, and must be updated whenever a change to these is implemented. That is, except pedagogically, rather a pointless exercise for most people.

Testing for Plausibility

It does seem useful to test whether a supplied E-address is something like reasonable; to make a check, for example, which would reject an attempt to supply a personal name, a postal address, a telephone number, a mere nickname, or nothing at all. For this, it should suffice to check for a match to 'something @ something . something', with little or no concern for the 'something' strings (which may themselves contain dots or spaces).

Consider the effect of   OK = /^.+@.+\..+$/.test(S)   for example; or the same without ^ or $ or with added \b. Consider also checking that no obviously invalid characters are present - using \S+ rather than .+ . The final field might be put as \w{2,} .

LRN has written :-
Actually, since the adding of IPv6, this would be a valid mailbox address format:
  someone@[IPv6:ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff]
(if I read RFC 2821 correctly). There are no dots after the at-sign. There will be a colon, though, so perhaps looking for   /.+@.+[.:].+/ would be sufficient for now.

It may also be worth checking that all characters are legal, since that might help a bad typist.

E-mail addresses are used in a case-independent manner; but checking should be case-independent.

An earlier version of this section was in JavaScript General.

E-Mail Addresses with Comments

IMHO, it is worth validating with /.+@.+\..+/ - something AT something DOT something, or with /(\S+@\S+\.\S+)/ , to ensure that something like an Internet-style E-mail address is present, as opposed to <empty> or some other data.

A good implementation will allow commented E-mail addresses; there are those who use something like familyname@service.invalid , so it can be really useful to insert Fred <familyname@service.invalid> in the field, if it will be honoured for a return message. There may be no very good implementations in use; but the RegExps above should allow it.

As well as 'X@y.z', addresses of the general forms indicated by 'Name Q Name <X@y.z>', '"Name Q. Name" <X@y.z>' and 'X@y.z (Names)' should be acceptable.

If that RegExp is used, or, rather, /<?(\S+@\S+\.\S+)>?/ , then it is the recognised part - RegExp.$1 or whatever is more compatible (Res[1]) - which may be tested for improper characters. More thought is needed if there is to be a check that any comment is RFC-compliant.

Testing

Use the RegExp Tester above to try those and other expressions.

Note that the left part, at least, of a deliverable address can include the characters $ _ | (dollar, underscore, vertical-bar).

DOS/Windows File Names

In draft. Needs testing.

The general form is something like

 Field				RegExp piece
 Quote			?	("?)
 Drive-letter Colon	?	([a-zA-Z]:)?
 Backslash		?	\\?
 Directories		?	(\w+(\.\w*)?\)*
 File				\w+(\.\w*)?
 Matching quote			\1

 Note:	I've put \w to represent allowed general characters;
	that is too restrictive, and needs to be changed.
	And I've not yet allowed for dot and multi-dot.

For the allowed general characters, there is a choice. To access any existing file, one should exclude only those characters which cannot possibly be present. But for file creation, it is possible instead to allow only pleasing characters; I often allow myself only [a-zA-Z0-9-] for maximum compatibility.

Validation Links

See also :-

A Scheme for Validation of the Fields of Forms

This is the usual sort of code :-

   f = element
   if ( f.value fails ) {  
     alert("Error ...")
     f.focus()
     return false }

usually repeated in-line for every field, with each test spelt out in full.


N.B. For efficiency, f should be evaluated only once per element.

The code for validation of simple, but multiple or non-short, Forms can be greatly condensed in comparison with the usual lengthy approach.

The actions needed in conjunction with validation of each of the fields will be similar, so one can use an Object to define each field to be tested and how it is to be handled, and then process an Array of those for each Form with a general, form-independent, encapsulating function. An editable example Object is given below the following test Form.

If a Form has, or multiple Forms have, fields needing identical validation, a single Object can be used in each case, rather than copying code.

If multiple Forms need identical validation, one could alter the code so that, as well as processing the Form, if any, named in the Object, subsequent parameters to the validation function would identify the Forms.

The scheme handles both textual and non-textual controls.

Error reports can have fixed and variable parts. The exact wording may need careful crafting, dependent on the circumstances, so that the result always reads well enough.

The technique should be adapted according to circumstances.

Text Pattern Validation by Regular Expression

Almost all textual fields can be usefully validated, at least initially, with a RegExp. For example, /^$|^\d{3}$/ will test for a field either being empty or having three digits.

For an accept-anything RegExp, // is not usable, but /^/ should be. However, one can omit the field from the Object in the array of tests.

For a field to be non-empty, the RegExp is /./ ; one may prefer /^\S+$/ - no blanks and at least one visible character - or /^\S/ - no blanks before at least one visible character.

This test is invoked by the optional element R .

Field Validation by Function Call

RegExps are not well suited to checking such things as a general numeric range or date field values.

An optional element V names a function to test the form element in more detail. The Age field is thus tested, after being pattern-checked. The function supplied by V can test both textual controls and non-textual controls such as Radiobuttons and Checkboxes. Its first parameter is the Field, and a second parameter given by optional element P can be used as required, as with AgeOK .

Such a function can take account of the values of previously-validated elements; but it should be better to handle interdependencies independently.

New optional parameters could be added for tests of other natures; for example, one reporting "that looks peculiar but is not necessarily wrong - confirm?".

A Test Form

The form on the left below is by default correctly filled in; try the test buttons, then try the effect of errors in its fields (here, visibly blank is always an error). The test parameters are defined by functions merely for convenient display; they would normally be supplied to the test functions as variables or literals.

For actual use, take the checkbox decision at design time, remove unnecessary code branches, and redesign error wording style.

The form's action="#" placates W3's Tidy; adjust it at need.

Generalised Form Testing
The Form ("FrmX") :






| Good | Will do | Bad |    RB




  errors


  errors

Error display by Alert
instead of being below :
 

Editable test-defining Object for FTry4 - similar to PrepObj but different age-range :-

The Code

On pressing a button, calling FTry#() provides the results.

Home Page
Mail: no HTML
© Dr J R Stockton, near London, UK.
All Rights Reserved.
These pages are tested mainly with Firefox 3.0 and W3's Tidy.
This site, http://www.merlyn.demon.co.uk/, is maintained by me.
Head.