See "About This JavaScript Site" in JavaScript Index and Introduction.
NOTE : This page is not intended to teach the use of Regular Expressions (RegExps) on its own; parallel access to a manual page or similar, and preferably to a well-written book or tutorial, is assumed. It uses JavaScript, and not VBScript.
RegExps are American-initiated : therefore, letters were taken to be the 52 characters A-z a-z only. Some systems may do a little better.
The term "Regex" is sometimes used; but "RegExp" is correct for JavaScript.
I would hope that Regular Expression Pocket Reference (O'Reilly) would be recommendable, too - but I've not AFAIR seen it. See http://oreilly.com/catalog/9780596514273/index.html.
Remember that manufacturers' sites generally describe their latest features at the time of writing. Therefore, some of the features which they describe may not yet be safe for Internet pages.
Note that some of the above are rather old, and so omit some features which are now fairly safe to use.
Validation is a major use, but not the sole use, for RegExps.
RegExps are extremely useful in string handling; they can be used for parsing, format checking, substitution, and field extraction. They can be powerful, yet they are easy enough to use for comparatively simple tasks. They are implemented in a number of programming languages, and also in editors and in editing tools such as SED and MiniTrue.
RegExps are used to match a specified pattern against the contents of a string, giving information as to what, if anything, was found.
A RegExp consists of a cabalistic sequence of characters and metacharacters. Characters A-Z a-z 0-9 and some others represent themselves. Metacharacters include ^ $ . \d \D for beginning, end, any character, decimal digit, not decimal digit. The character \ means that the following character(s) do not have their usual meaning; so \\ means the actual character \ .
It is commonly useful to parse a string with a RegExp as an aid to splitting it into fields. See, for example, the source-code of Date and Time 4 : Validation.
Other examples are below, and in JavaScript Tests, JavaScript/HTML/VBS Quick Trials and other pages, and in various batch files using MiniTrue (mtr).
For RegExps in VBScript, see Regular Expressions.
JavaScript RegExps are Objects with Methods and Properties, and can be used as parameters for functions and for methods of other Objects. RegExps do require, I believe, JavaScript 1.2, which means that common browsers must be at least version 4.
A RegExp is created by new RegExp("~~~") or given as a literal by /~~~/ .
Note that RegExps have "hidden variables". In particular, a repeated search may not start at the beginning.
Modifying flags i g can be applied, for case-independence and global-matching.
A user of this page will need access to a syntax reference list; a Web search for "Regular Expression Syntax" should find one of the standard lists of metacharacters.
Object Method(arguments) Result Fail RegExp .exec(S) Array null RegExp .test(S) Boolean n/a String .match(R) Array null String .replace(SR, SF) String n/a String .search(R) Number -1 String .split(SR) Array n/a R is a RegExp, obtained if necessary by using new RegExp() S is a String SF is a String or a Function SR is a String or a RegExp
The methods of String are generic, and can be transferred to other Objects.
In a RegExp, ? can be used for a "non-greedy" match. The feature was only introduced at MSIE 5.5 and NN 6, and was thus rather unsafe on the open Internet. See RegExp Feature Testing below.
Note that look-ahead is also relatively recent.
x = /^(?!(.*abc.*))/.test("555abc888") // S has no substring "abc"
The RegExp.$1 notation is now deprecated in JavaScript; it seems that Res[1] has to be used, where Res is the result of the exec method of the RegExp. Alternatively, perhaps use Res = String.match(RegExp). In Opera 9.64 snd 10.00 at least, although X = RegExp.$1 works, with (RegExp) { X = $1 } appears not to.
I have heard that in JScript.NET with ASP.NET the RegExp.$1 notation is not available.
Calling for unavailable RegExp features can result in wrong results or an error message. In IE 4, for example, both writing /a??/ and without protection executing new RegExp("a??") caused an "unexpected quantifier" message. Results of executing RegExps containing unsupported features may differ among browsers; there may be no error message. See whether the FAQ of news.comp.lang.javascript, and/or its notes, yet help.
In IE4, this safely detected the lack (for later browsers only, consider try/catch) :-
<script> function catchError() { alert('Caught') ; return true } // shown in IE4 testRe = "Before" var onerrorSave = window.onerror window.onerror = catchError testRe = new RegExp("a??") // Caught in MS IE 4, OK in IE6 window.onerror = onerrorSave alert("1 " + testRe) // not shown in IE4 // shown as "1 /a??/" in IE6 </script> <script> alert("2 " + testRe) // shown as "2 Before" in IE4 // shown as "2 /a??/" in IE6 /* P.S. 20061204 : typeof testRe → "object" / "string" but start with testRe = "" and finally use OK = !!testRe */ </script>
Extended RegExp literals must not be written in code intended to work with IE4.
The set of characters matching \s (and so not matching \S) is browser-dependent. Note that there may be space-like characters not recognised by \s in any tested browser.
Characters are here represented as four-digit Hex code. Following script generates a string containing all 65536 Unicode characters from 0000 to ffff. Then it collects and reports all those which match \s. Also, \w is fully tested.
In browsers, the first six characters found are generally 0009 000a 000b 000c 000d 0020 = HT, LF, VT, FF, CR, Space. All characters, selected by number, are explained via Unicode Code Charts, using the look-up at the top, and characters are also listed alphabetically by name.
0009 = HT
000a = LF 000b = VT 000c = FF 000d = CR 0020 = space 0085 = next line 00a0 = non-breaking space 1680 = Ogham space mark |
180e = Mongolian vowel separator
2000 - 2000b = spaces of different sizes, including zero 2028 = line separator 2029 = paragraph separator 202f = narrow no-break space 205f = medium mathematical space 3000 = ideographic space feff = zero-width no-break space |
That appears to be a complete list of the 28 Unicode blank characters. |
Unicode includes some non-blank characters representing spaces, and 0000.
See, at The Unicode Consortium, Unicode Character Properties, listing 26 characters as "White_Space" (page omits 200b & feff).
See also Whitespace deviations.
A common programmer fault is checking that a RegExp (or other) test matches what it should match, but failing to check that it also does not match all that it should not match.
LF shows code for the corresponding operation with a RegExp literal, as commonly used.
A RegExp can be tested in .match() and then used in any of the methods listed above.
Validation, as below.
Number : in JavaScript Maths.
Date : in Date and Time 3 : Input and Lengths and Date and Time 9 : Output Formatting.
To count characters of a specified type in a string, replace the rest with nothing and take the length of what is left; alternatively, use .match. To count lower-case English letters, use :-
count = msg.replace(/[^a-z]/g, "").length count = msg.match(/[a-z]/g).length // if known that count > 0 count = (T=msg.match(/[a-z]/g)) ? T.length : 0 // otherwise
To count words in a string, replace each word with "a" and then each gap with nothing, then take the length of what is left : so, if \S adequately defines a word character, use :-
count = msg.replace(/\S+/g, 'a').replace(/\s+/g, '').length
Note - the definition of "word" is not altogether easy : consider "cat's-paw".
To count a specific computed substring SS :-
count = (msg.match(new RegExp("("+SS+")", "g")) || "").length
Note that innerHTML may not be exactly as expected.
To count occurrences of a word in a textarea using a RegExp and modifiers supplied as strings :-
S = F.X2.value OK = /^\d{1,4}(\.\d\d?)?$/.test(S) if (OK) S = S.replace(/(\.\d)$/, "$10").replace(/^(\d+)$/, "$1.00")
The first line gets a string; the second checks it for having 1-4 digits, optionally followed by a fractional part of one or two digits; the third, which may not be optimum, normalises it to have two decimal places. It can be tested in JavaScript/HTML/VBS Quick Trials.
For business purposes, where for example a Web page is a client passing information to a server, remember that client-side validation can be suborned. Client-side validation is a convenience for the honest customer, but only server-side validation can be trusted.
Ensure that such forms do not fail if JavaScript is not enabled.
Remember that users may be in different countries, and accustomed to differing formats.
It is not always practical to classify all possible input strings as being either an undoubted pass or an undoubted fail. It is then necessary to decide whether the indeterminate cases should either be treated as pass or fail or referred for user decision.
In the case of dates, for example, there is no need to accept all possible valid formats, as long as at least one valid format is indicated and the indicated formats all pass; the user can always try again until compliant.
For E-mail addresses, however, where the necessary formats may change during the life of the code, one should do no more than reject any which it is thought cannot possibly be right - blank, for example. It may be well only to warn if an address seems incorrect, allowing it to be used if the user agrees.
Validation for character pattern is almost always best done with a RegExp, and other validation is often easier if a RegExp is used first.
Before or after a field has been validated, the value should generally be moved to a simple local variable as an appropriate type. There is no need for repeated reference to such as document.Form1.Thingy.value or document.forms["Form1"].elements["Thingy"].value; for example, if that's known to be a number, use such as var Val = +document.Form1.Thingy.value or for a CheckBox var OK = document.Form1.ThatBox.checked.
Code should not be repetitive; so use functions with parameters to handle different fields with common code.
The only possible need is to ensure that a selection exists.
The only general need is to count selections.
Date/Time Validation is via Date and Time Introduction. It can be done by using a long and complicated RegExp, but it is far better done by using a RegExp for pattern, then a Date Object (likely to be wanted anyway) month check to check that the digits are acceptable.
One should not allow a decimal point with no digit before it; nor with no digit after; such constructions are liable to be mistakes. See, for example, SUNAMCO 87-1 (IUPAP-25), 1987 revision, by Cohen & Giacomo, Section 1.3.2.; either it, or a successor, ought to be on the Web somewhere.
If a - sign is allowed, a + sign should be allowed and might be mandated.
The standard function isNaN() is frequently too general. It is often best to use a RegExp for pattern validation, at least initially, and also for separating fields; but generally not for checking numeric values. Function isFinite() may also be useful here.
RegExp examples (R) :- /^\d+$/ All-digit /^[1-9]\d*$/ All-digit, non-zero /^\s*[-+]\d+\s*$/ Unbroken signed integer + spaces /^\d{1,5}$/ 1 to 5 digits /^\d+\.\d\d$/ As currency, 2 decimals /^\d+(\.\d{2})?$/ As currency, 2 decimals optional /^[+-]?\d+(\.\d+)?(e[-+]?\d+)?$/i Allows +3 -3.7 -3.4e+56 etc. not /\..*\./ Fewer than two dots /^\d{1,3}(,\d\d\d)*\.\d\d$/ Thou-Sep Number dot 2 decimals Sample usage :- T = document.forms["FrmQ"].InputID.value // String OK = R.test(T) // Use a RegExp on T // Boolean // If error, prompt and go back InputValue = + T // Number, or InputValue = + RegExp.$n // for n in 1..
The RegExp Tester above can be used.
An earlier version of this section was in JavaScript Maths.
See also in JavaScript Maths on input of numbers.
will test St for being all-digit, for example - it returns (not (a-non-digit is-found in-the-string)) - but it will accept an empty string.
To test for being at least one digit and nothing else, and obtain and test the value :-
A non-integer decimal number should have at least one digit on each side of the separator, which I here assume to be a decimal point "." - GoodDecimal() tests such; for MayHavePoint() the fractional part is optional.
Cash() should test currency formats - it requires consecutively : the-beginning one-to-three-digits (comma three-digits)-any-number-of-times dot digit digit end OR the-beginning at-least-one-digit dot digit digit end. To remove such commas, one can use .replace(/,/g, '') .
Once shown to be acceptable, and comma-free, such a string can safely be converted to a number with the unary + operator.
Here it usually best to check by attempted conversion.
If the number must be given in floating-point notation, then test for the presence of the letter E. One can also test for the presence of a decimal point, and, if so, that it has a digit on each side; one can test for a leading sign, or one after the E. Such tests are usually unnecessary.
Assume that the input is a text box be F.X0 ; then the best move seems to try numeric conversion first. There will normally be a range of allowed values, given by a least value L and a greatest G. The following should suffice :-
x = +F.X0.value OK = x >= L && x <= G
which, if I am not mistaken, correctly handles NaNs and Infs.
After Bogdan Blaszczak :-
function ChkStr(RE, St, Msg) { return RE.test(St) || !!alert(Msg) } OK = ChkStr(/^\d+(\.\d{2})?$/, "666.65", "NoGo") // currency
Amounts of currency (in currencies with 100 minor units per major unit) should always be written either as an integer or with exactly two decimal places. The representation of the decimal point varies, and thousands separators may be allowed. Amounts smaller than 1.00 must have a zero before the decimal separator.
The basic RegExp is of the form /^\d+(\.\d\d)?$/ .
function BadIP(S) { var M, J, T // e.g. "004.02.155.133" M = S.match(/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/) if (!M) return "bad pattern" for (J=1; J<=4; J++) { T = +M[J] if (T==0 || T>254) return "bad #"+J /* Range limited */ } return null }
There should at least be a check against T>255. The result can be converted to Boolean with the ! operator.
Telephone numbers are not all of the form (###) ###-#### - not even in North America, since Mexican land-lines, and mobile numbers, are different. Most telephone numbers, in fact, differ.
International standards have permitted a surprisingly large character count. On the other hand, internal telephone numbers in small countries can be rather short.
Consider mobile phones, and foreign networks, and country code; and maybe the possibility of "extra" numbers for internal routing.
For input, be adequately liberal in all respects.
For output, note that (from within the UK) London numbers are of the form 020 ABCD EFGH where A can be 3, 7, or 8. They do not start 0203 / 0207 / 0208 (a London number has 8 local digits, and its area code has 3). The best form for quoting a London number world-wide is +44 (0)20 dddd dddd .
Consider X = TN.replace(/\D/g, "").length OK = X>=4 && X<=15 where TN is the putative number and 4 is a fairly safe guess. Combined, perhaps, with something like OK = /^[0-9()+ -]*$/.test(TN) if there is a need to accept only reasonable punctuation.
J'ai lu en <48d7d416$0$30969$426a74cc@news.free.fr> : Pour information, les chaînes E164 sont définies ainsi : <pattern value="(\+[0-9]{1,3}\.[0-9]{1,14})?"/> <maxLength value="17"/>
For detailed advice, try reading Telephone number and ITU recommendations.
/\s/ Has any whitespace /\S/ Has no whitespace /\w/ Has any in A-Za-z0-9_ /\W/ Has any not A-Za-z0-9_
To read a password, use an input field of type password.
Passwords should not be transmitted en clair over insecure links, nor stored in a manner from which the original can be deduced. When a password is entered, it should immediately be converted with a one-way function, and it is the result of that which should be stored for comparison.
Client-side password checking is generally insecure, but proposed passwords can properly be assessed client-side.
A password should be of reasonable length and free of problematic characters (spacing, accented characters, punctuation). Often it is stipulated that it should contain at least one digit and at least one letter, perhaps at least one letter of each case. Note that \w accepts underline.
For assessing a proposed new password, consider :-
x1 = /^[a-z\d]{6,10}$/i // only alphanumerics, and length 6-10 x2 = /[a-z]/i // a letter present x3 = /\d/ // a digit present OK = x1.test(w) && x2.test(w) && x3.test(w)
It is generally a mistake to attempt to do the full check in a single RegExp.
There is no need for full validation when an E-mail address is created. Much of it will be standard, and the rest can be more restricted than the RFCs allow. When a Web form calls for an E-mail address, partial validation is possible and reasonable.
See Wikipedia E-mail address.
One cannot in a Web page fully validate an E-mail address; for proof of this, it is sufficient to observe that, with dial-up Internet machines, it is possible for an actual address to become valid or invalid by a change in data on a machine which is currently isolated from the Net.
Moreover, the set of allowable top-level domain names (TLDs) is subject to change.
It is possible to use a complex RegExp method to test validity against any given format. For this to be useful, it must give results in exact agreement with all applicable RFCs and practices, and must be updated whenever a change to these is implemented. That is, except pedagogically, rather a pointless exercise for most people.
It does seem useful to test whether a supplied E-address is something like reasonable; to make a check, for example, which would reject an attempt to supply a personal name, a postal address, a telephone number, a mere nickname, or nothing at all. For this, it should suffice to check for a match to 'something @ something . something', with little or no concern for the 'something' strings (which may themselves contain dots or spaces).
Consider the effect of OK = /^.+@.+\..+$/.test(S) for example; or the same without ^ or $ or with added \b. Consider also checking that no obviously invalid characters are present - using \S+ rather than .+ . The final field might be put as \w{2,} .
It may also be worth checking that all characters are legal, since that might help a bad typist.
E-mail addresses are used in a case-independent manner; but checking should be case-independent.
An earlier version of this section was in JavaScript General.
IMHO, it is worth validating with /.+@.+\..+/ - something AT something DOT something, or with /(\S+@\S+\.\S+)/ , to ensure that something like an Internet-style E-mail address is present, as opposed to <empty> or some other data.
A good implementation will allow commented E-mail addresses; there are those who use something like familyname@service.invalid , so it can be really useful to insert Fred <familyname@service.invalid> in the field, if it will be honoured for a return message. There may be no very good implementations in use; but the RegExps above should allow it.
As well as 'X@y.z', addresses of the general forms indicated by 'Name Q Name <X@y.z>', '"Name Q. Name" <X@y.z>' and 'X@y.z (Names)' should be acceptable.
If that RegExp is used, or, rather, /<?(\S+@\S+\.\S+)>?/ , then it is the recognised part - RegExp.$1 or whatever is more compatible (Res[1]) - which may be tested for improper characters. More thought is needed if there is to be a check that any comment is RFC-compliant.
Use the RegExp Tester above to try those and other expressions.
Note that the left part, at least, of a deliverable address can include the characters $ _ | (dollar, underscore, vertical-bar).
In draft. Needs testing.
The general form is something like
Field RegExp piece Quote ? ("?) Drive-letter Colon ? ([a-zA-Z]:)? Backslash ? \\? Directories ? (\w+(\.\w*)?\)* File \w+(\.\w*)? Matching quote \1 Note: I've put \w to represent allowed general characters; that is too restrictive, and needs to be changed. And I've not yet allowed for dot and multi-dot.
For the allowed general characters, there is a choice. To access any existing file, one should exclude only those characters which cannot possibly be present. But for file creation, it is possible instead to allow only pleasing characters; I often allow myself only [a-zA-Z0-9-] for maximum compatibility.
See also :-
This is the usual sort of code :-
f = element if ( f.value fails ) { alert("Error ...") f.focus() return false }
usually repeated in-line for every field, with each test spelt out in full.
The code for validation of simple, but multiple or non-short, Forms can be greatly condensed in comparison with the usual lengthy approach.
The actions needed in conjunction with validation of each of the fields will be similar, so one can use an Object to define each field to be tested and how it is to be handled, and then process an Array of those for each Form with a general, form-independent, encapsulating function. An editable example Object is given below the following test Form.
If a Form has, or multiple Forms have, fields needing identical validation, a single Object can be used in each case, rather than copying code.
If multiple Forms need identical validation, one could alter the code so that, as well as processing the Form, if any, named in the Object, subsequent parameters to the validation function would identify the Forms.
The scheme handles both textual and non-textual controls.
Error reports can have fixed and variable parts. The exact wording may need careful crafting, dependent on the circumstances, so that the result always reads well enough.
The technique should be adapted according to circumstances.
Almost all textual fields can be usefully validated, at least initially, with a RegExp. For example, /^$|^\d{3}$/ will test for a field either being empty or having three digits.
For an accept-anything RegExp, // is not usable, but /^/ should be. However, one can omit the field from the Object in the array of tests.
For a field to be non-empty, the RegExp is /./ ; one may prefer /^\S+$/ - no blanks and at least one visible character - or /^\S/ - no blanks before at least one visible character.
This test is invoked by the optional element R .
RegExps are not well suited to checking such things as a general numeric range or date field values.
An optional element V names a function to test the form element in more detail. The Age field is thus tested, after being pattern-checked. The function supplied by V can test both textual controls and non-textual controls such as Radiobuttons and Checkboxes. Its first parameter is the Field, and a second parameter given by optional element P can be used as required, as with AgeOK .
Such a function can take account of the values of previously-validated elements; but it should be better to handle interdependencies independently.
New optional parameters could be added for tests of other natures; for example, one reporting "that looks peculiar but is not necessarily wrong - confirm?".
The form on the left below is by default correctly filled in; try the test buttons, then try the effect of errors in its fields (here, visibly blank is always an error). The test parameters are defined by functions merely for convenient display; they would normally be supplied to the test functions as variables or literals.
For actual use, take the checkbox decision at design time, remove unnecessary code branches, and redesign error wording style.
The form's action="#" placates W3's Tidy; adjust it at need.
Generalised Form Testing | |
---|---|
The Form ("FrmX") : |
Editable test-defining Object for FTry4 - similar to
PrepObj but different age-range :-
On pressing a button, calling FTry#() provides the results.