logo To Foot
© J R Stockton, ≥ 2010-03-07

Check Local Links and Anchors.

No-Frame * Framed Index * Frame This
Links within this site :-
This page needs JavaScript & include1.js
and wants styles-a.css.

Links Check

This form spiders a local copy of a Web site tree, starting at the page indicated, and working outwards by following relative links, mainly to locate Link and Anchor errors. Miscellanea are reported. Pages are read into an iframe and parsed by the browser. See also my earlier Pascal/Delphi program CHEKLINX.EXE, via programs index.

: Directory File
: Base of Site
relative to Base of Site : : Starting Page
Current : for Directory File content
: Readable Extensions
Page loaded by timeout : ms     Shibboleth : g
Twin :     WPgN :
  Self :   GoUp :   WkDy :
 

Status

The table may change or be out-of-date.

BrowserEffects
Misc.Timeout!=0Timeout==0Read SelfGo UpUsable
MS IE 8[1]OKNOYESyes [5]yes
Firefox 3.0OKOKNO [2]no? [3]YES
Opera 10.10OKOKNO [4]YESYES
Safari 4.0OKOKYESYESyes [5]
Chrome 3.0OKOKYESYESYES

I developed this with this page, the Starting Page, and the Directory File all in the same directory. Base of Site now works, but needs tidying up internally.

Notes

This page has been developed mainly in Chrome (which is fast) and Opera (where success came first), using Windows XP sp3.

The pages scanned should be free of HTML and onload script syntax errors. Only relative links are followed, which may rule out sites generated by Microsoft FrontPage.

It was assumed that all folder and file names (but not anchor names) within the site will be lower-case on the server, and that therefore they can be made lower-case within this code. Folder and file names MUST NOT be case-dependent.

My earlier program CHEKLINX.EXE reads the files exactly as on disc, with simplified parsing. This page LINXCHEK.HTM scans page structures at the completion of their loading. Scripts executed during loading can, but commonly do not, add and/or remove anchors and/or links.

Apparently, some composing systems put full absolute URLs in links where relative ones would suffice. They will not be recognised by this page. To handle that, take a copy of the site and, using an automated editor, remove the absolute part from the links. For example, an absolute link http://www.igb.at/index.htm in a site based at http://www.igb.at/ is to be replaced by index.htm.

Local links, without explicit protocol, are listed as file protocol.

Algorithm

Names can be stored as string elements of an Array. On the other hand, they can be used to name properties of an Object. That avoids name duplication, and one can determine without ostensible search whether a name is already present. Such Objects are used here to handle page names, anchor names, the names of pages linked to, etc.

The Directory File (if any) is read first, and its lines are stored using an Object as in the box on the left.

A complete Entry
{Name: string,
 Shib: number,
 Ankas: object,
 Dupes: number,
 Cites: object,
 Next: object}

The named Starting Page is read next. Page data is held in a linked list, and a complete entry is as the boxed form on the right. When a page is read, its anchors and links arrays are attached to Ankas and Cites. New entries named in Cites that are not folder names, exist on the disc (using the object FromDIR), and have an extension given as acceptable are added to the list. An object KnownPages is given an entry named for each file added to the list, and used to determine whether a name is new. While a page is being read, a similar object detects duplicate anchors; and other tests are done and lists made.

After all necessary pages have been read, the entries are scanned to see whether all anchors are present on the appropriate pages, and whether any are duplicated on a page, etc.

Operation

Loading

The code uses simple browser-testing in order that, at least on my main machine, the controls are initially set suitably.

If the page is invoked with something like linxchek.htm?GoAt=page.htm&Tout=0 (case-dependent query part) then it will immediately run beginning at page.htm, with elements of the Form optionally set by name from the query string. Form input controls currently are, in order, DirF Base GoAt Xtns Tout Shib Smod Self GoUp WkDy Twin WPgN . Use 0/1 or true/false for checkboxes.

Otherwise, you must set the controls and press the long button in the usual way. If there is no such button revealed, there has already been a JavaScript error.

Directory File

A Directory File should be named, with the full relative path from the directory of this page. If the field is empty, linked site files will be read on the presumption that they exist. If they do not, the consequences may be browser-dependent.

c:\current\astron-1.htm
c:\current\programs\
c:\current\programs\someprog.pas

Its contents must resemble what, at a Windows XP command prompt, is given for the site by DIR /B /S > $DIR.TXT (see green box on right) executed in the base directory of the site. The characters / \ are equivalent; directories do not need a trailing slash, case does not matter. If you test on the system that you are using to read this, the c:\current\ part must match the corresponding part of what then appears in a yellow box above after ' this : '.

The program will then read only page files that exist and have listed extensions.

The decision as to whether a file is deemed Present or Missing depends entirely on the Directory File that the user provides. Peculiar cases may remain to be handled. For example, a link to a file name containing ^ was earlier considered to be a link to a file with %5E in that position.

The Directory File is read using the iframe. Its contents are read with textContent or innerText.

Base of Site

The Base of Site is the full relative path from the directory of this page to the root of the local copy of the site to be tested - typically, the relative location of its index.htm page. If not empty, it should end with "/".

Control GoUp needs to be checked if Base of Site is not empty.

Starting Page

The Starting Page is given with the full relative path from the Base of Site - typically, just as index.htm. It should be, and may need to be, be that all pages of the test site are in that directory or its subdirectories.

Dates

When the WkDy box is checked, the pages are scanned for anything like an ISO 8601 date (yyyy-mm-dd) followed by whitespace then exactly three characters each of which is a letter or a question mark (XXX), followed by a blank. Whenever that is found, the date part is checked for validity and the day-of-week of the date is compared with XXX. Certain common XXXs, such as "the" and "and", are disregarded. The code to do this was taken from Day-of-Week Checking in JavaScript Miscellany 1, with subsequent changes.

Years above 275350 are not checked. Negative years may be a problem?

The best ways of dealing with false positives include
  • If the date is intentionally invalid, use another separator;
  • Rephrase so that the date, as displayed, is not followed by a TLA.

This option slows the program considerably, in some browsers.

Starting

There may be no test for the Starting Page being missing; but that will rapidly become obvious.

Misc

A browser may not let a page load a copy of itself into its own iframe. If Self coerces to false, this page will not be queued for reading. To check that, try that page as Start File.

A browser may not be able to handle pages in subdirectories here. If GoUp coerces to false, subdirectory pages will not be queued for reading.

If a file or directory name could be interpreted by JavaScript as a number, then there could possibly be a clash with another name that corresponded to the same number.

Some browsers MAY get confused about which directory a page is in.

With some browsers, e.g. Opera 10.01, it may be necessary that the pages scanned are free of "major" errors.

If Twin is checked, a quasi-alert is given if, within a page, two consecutive links display the same text. That might be improved upon.

If WPgN is checked, then in All_LinkTexts the name of the current page precedes the text of the link.

Running

The code may seem slow to start; maybe a browser needs to get resources.

The code may tend to appear to run in fits and starts.

The browser's error-display system will indicate, for the pages scanned, any recognised errors in their HTML and in any of their scripts that run during or on page-load. Dismissing such an error should allow this page to proceed.

The Clear button will abort running and enable the big button.

After Scanning

A run is completed and analysed when the status line of the form goes green, the iframe vanishes, and more buttons appear instead, under the Status line within the blue Form. Press them in any sequence.

Note that, when the SORT button is used, all numbers and any surrounding spaces are considered equivalent (to " # "). A number is a digit string bounded by blanks.

Notes to Me

More into Consolidation? Check drive matches $dir.txt. Remove Current from FromDIR indexes?? Test a single file??

Glossary

LocalOutLinks
Links to files on the current machine but not in the directory of the starting file or subdirectories thereof. Not common in Web sites, but useful to me, e.g. +.
Page loading timeout
An entry of zero or blank means that code to automatically detect the end of each frame loading is to be used, which does not work in all browsers. Otherwise, a page will be read with that delay after calling for loading. If the delay is too short, later links and anchors, etc., will not be found. Try 300 ms.
Readable Extensions
Extensions of files which can be safely loaded into the iframe. Files with other extensions are not read.
Current
The directory containing your copy of this page.
Shibboleth
An argument for a new RegExp(), which is counted in the body.textContent or body.innerText of each page read. Do not double backslashes. The results are given in Consolidation and EntitySummary.
Unlike8point3
I still use utilities from DOS days, and so check for restricted DOS-type file names.
Home Page
Mail: no HTML
© Dr J R Stockton, near London, UK.
All Rights Reserved.
These pages are tested mainly with Firefox 3.0 and W3's Tidy.
This site, http://www.merlyn.demon.co.uk/, is maintained by me.
Head.