Merlyn - Pascal Floating-Point

Links within this site :-

Merlyn Home Page - Site Index, E-Mail, Copying
This Page (partly specific to Turbo Pascal and Borland Pascal and Delphi) :-
Pascal Introduction :-
Delphi Introduction
In Pascal/Delphi Rounding and Truncation (partly specific to Turbo Pascal and Borland Pascal) :-
- Converting from Float to Integer
- Converting to n Decimal Places
In Pascal / Delphi / + Types :-
In Pascal Maths :-
- Multiplication or Division
- Exponentiation : X^Y, X^N
- Div & Mod : negative arguments
- Complex (A+jB) Arithmetic
- Trigonometry, Pi
- Logarithms
- Compile Time Operations
- Run-Time Library
- Standard Maths Unit, absence of, others' units
- Large Integers, etc.
- Prime Numbers.
- Pentium FDIV bug

The term "Borland's Pascal" here includes "Turbo Pascal" and "Borland Pascal" and often "Borland Delphi".

The following sections are written for Borland's Pascal. In other Pascals, the type real is generally the best to use. For portability, maybe declare variables as float so that a single dialect-dependent type float = ... ; statement can be used to select.

Representation of Numbers

In the Outside World, numbers are represented in decimal notation. Within a PC, numbers are (but not invariably) represented in binary notation. Exact administrative arithmetic would be easier in a decimal-based computer. However, there is no difficulty in representation and conversion where all numbers can be represented as integers; nor in scientific / technical / measurement work, where a little uncertainty in the results is fully acceptable.

There are two basic types of numbers (disregarding complex numbers, quarternions, ...) :-

As "real" has a very specific meaning in Borland's Pascals (and another in higher-numbered Delphis), I will here use "float" as the generic term for fixed- and floating- point.

Number representations vary across computer types and languages; specific ones are used in PC Pascal. Float quantities can only be represented to finite accuracy in a computer; few decimal floats can be represented exactly in binary, and vice versa. In decimal text, "E" (or "e") is taken as meaning "times 10 to the power of". Conversion between binary float and decimal float is non-trivial; it is a pity that early man began to count on his upper digits, rather than on only his fingers.

In Pascal and Delphi, as in many other languages, numbers are represented in binary notation (except where explicitly programmed otherwise, in detail); and there are no true fixed-point types other than Delphi's currency.

The Floating Point Unit (FPU) of a PC has, in addition to the three IEEE-standard true floating-point types single, double, and extended, a type without an exponent, comp, which takes only integer values. The comp type is used, in Delphi, as the basis for the currency type, which works in units interpreted as ten-thousandths.

Floating Decimal Numbers

The trouble with floating decimals is that we don't have hardware or language support, though routines to emulate it using a record type with separate mantissa and exponent should be easy enough, if slow. Try a Google search for "John Herbster" and/or "JohnH".

General Floating-Point

Floating-point (real) arithmetic is necessarily inexact (see Guido Gybels at New Location); even simple-seeming numbers like 1/3 or 0.2 cannot be represented exactly. That is why the "printed" value of a variable, or the output of a simple calculation, may show an inexact value - something like 1.199999997 or 1.200000003 where 1.2 or 1.20 is expected.

Accuracy and resolution must not be confused. In the case of single and double the accuracy should approach the resolution; but for the highest resolution case, extended, the accuracy, while better than for double will be appreciably less than the resolution. Usually.

Borland's Pascal and Delphi use IEEE single, double, and extended types, and also the archaic deprecated 6-byte real/real48 types. For conversions by arithmetic to a numerical value, which could be translated to another language, see in Pascal / Delphi / + Types.

Ordinary JavaScript type Number uses IEEE Doubles, though other formats may be imminent (mid-2001); see JavaScript Maths and via JavaScript Rounding 0.

Note that, while A+B = B+A, it may be that A+B+C <> C+B+A, since that is a comparison of (A+B)+C & (C+B)+A which may round differently. In JavaScript IEEE Doubles, X = [0.03+0.03+0.01, 0.01+0.03+0.03] gives [0.06999999999999999, 0.07].

References

IEEE Standard 754 describes Floating Point Numbers, giving formats for normalised, denormalised, zero, NaN, Inf, indeterminate. It has (?) the 4-byte Single type, the 8-byte Double type, and the 10-byte Extended type, with formats as in Pascal / Delphi / + Types. For links about IEEE floating-point, see EFG's Item B-5.

Seek also the edited reprint (large: 500K + graphics) of a paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, (1991); while parts are esoteric, much is easy to read. One URL,

Read EFD, "Numerical Accuracy 101 for Delphi Developers"; its principles apply widely (gone, Aug 2005?) (now at archive.org).

For much information on floating point, read BPL70N16.ZIP by Norbert Juffa {norbert@iit.com}, which includes a maths library. It is presumably still on Garbo, maybe updated. He also wrote "Everything You Always Wanted To Know About Math Coprocessors". The latest I have at hand is 'copro16a.txt' (dated 01-Oct-94). This document appears to be available (2007) from SimTel, Garbo and KiArchive.

Integer, Fixed, or Float?

If in essence dealing with countable items, use integer-like types whenever practicable. Delphi Currency is substantially integer, using centipenny units.

After that, in the rare cases where the work is dominantly adding or subtracting, use fixed point, if available.

In normal technical work, [and] where the work includes significant multiplication and division of imperfectly-knowable quantities, use floating point.

With the 10-byte Extended type available, few practical calculations will have inadequate resolution. Where resolution lacks, consider using not "value" but "value minus nominal" or similar.

When doing currency conversions, check for actual rules. Conversions of actual money between Euroland currencies (the national currencies of countries which are/were about to adopt the euro as their currency, and have fixed exchange rates) must follow strict rules; a strict rule says so.

One can see that Delphi's type TDateTime = double has dangers (it also has ridiculous properties before 1899-12-30). Days, though actually measured and slightly uncertain, are considered countable. For administration, one should use integer or longint or suchlike Days; for most science, floating-point seconds. In programs/, mjd_date.pas and dateprox.pas (etc.) have date code.

Comparison of Floating-Point Values

Floating-point values cannot reliably be compared for equality, except sometimes with zero or when obtained by copying. Where exactness is required, either use integral types only, perhaps by scaling, or if appropriate convert by Round, Trunc, etc. to integers.

If the arithmetic required is exact in pence, then do the calculation in pence not in pounds; that should be exact, even with float variables, up to the limit for the type.

Comparison functions of two types, absolute and relative, can be written, to test whether :-

Borland's Pascal Floating-Point Types

The archaic six-byte real type of Borland's Pascal is not native to the hardware, and so is slow. Borland's types single, double and extended are, in all modern PCs, supported directly in the FPU hardware.

Originally, only the six-byte type, real, was used by Borland and implemented in software. Subsequently, with the IEEE floating-point standard and coprocessors and built-in FPUs using it, the IEEE types (single, double, extended, comp) are almost always greatly preferable. TP5 and up provide an emulator allowing use of IEEE types on machines without hardware support. Indeed, Delphi 4 and later redefine real (as meaning double), and provide the six-byte real48 type for backwards compatibility.

The sub-section "Float Formats", containing a table, which was here, is now in Pascal / Delphi / + Types. See there also for the sizes of all simple types in Pascal (TP, BP, FPC, TMT) and Delphi versions.

Float Usage

Something not in the comp.lang.pascal.borland newsgroup mini-FAQ (ZIP), nor in Prof. Timo Salmi's Pascal FAQ until Item #133 appeared, and which I suspect may be insufficiently considered by the less fully experienced :-

What are the trade-offs in using the various Float types - single, real, double, extended, comp - and the dependence on available hardware - FPU, possible FPU, no FPU? I have my own ideas, but some might be disputable.

I suspect that many users just tend to use the 6-byte real type regardless, even though in Borland's Pascal it is generally much slower than using an IEEE-754 type.

And there's otherwise no harm in using extended everywhere else, space permitting. But this may be over-simplistic.

It is probably unwise to use the standard float type identifiers frequently in a program; one should use one's own type names for the different sorts of variables one is using, and these should be initially defined within a "COMMON.PAS" unit in terms of single, real / float, double, and extended. One can thus at a later time redefine one whole set of variables to be of a different type, should this prove desirable.

It is possible, though unwise, to have type real = extended ; it is better to use type float = extended ;.

I wonder whether selection of floating types is in Glenn Grotzinger's Turbo Pascal Tutor?

Literal Constants

Where floats are written in code with no type information, they will (if memory serves) be stored as for the extended type. Comparison for equality with non-extended variables will not give useful results, unless the values happen to be such as are stored exactly.

Quotes

Date: Wed, 12 Jun 1996 18:28:13 -0500
From: Will <libbus@ix.netcom.com>
To: Dr John Stockton
Subject: Re: Q not FA ? - which float type?

The REAL type is an artificial Borland construct; it is a compromise between accuracy (better than single, worse than double) and speed. It is not supported by any FPU, Intel or otherwise.

The SINGLE, DOUBLE and EXTENDED types are IEEE constructs which are supported by all FPUs natively. The main difference between them as regards programming is size: SINGLE is 32 bits (2 words), DOUBLE is 64 bits (4 words) and EXTENDED is 80 bits (5 words). If you are using a REAL FPU (as opposed to the software emulator), then ALL IEEE data types are internally converted to EXTENDED before being used. REAL types ALWAYS use the internal software math package because they are not supported by the FPU.

If accuracy is not a major factor, but speed is and your program will possibly run a machine which does not have an FPU, I would use the REAL type. If accuracy is a factor, regardless of whether there is an FPU or not, I would use one of the IEEE types, depending upon space considerations.

NOTE: I'm not sure (and I'm not about to dig out my TP manuals :) whether the software emulator promotes SINGLEs and DOUBLEs to EXTENDED before working on them; I'm pretty sure it does, which would make the only reason for using SINGLE or DOUBLE space considerations.

In an article of Sun, 23 Jun 1996 22:50:40 in comp.lang.pascal.borland, JWillard44 <jwillard44@aol.com> wrote :-

REAL is a six byte per variable type that was invented by Borland (or more likely, the people that Borland bought TP from) because it results in reasonable accuracy but can be manipulated quickly without a NDP. It is a good compromise, but it is a compromise. It was developed primarily for the limited resources of the CP/M world.

SINGLE, DOUBLE, EXTENDED are different IEEE floating point types of 4, 8, or 10 bytes respectively. The only difference among the three types is how they are stored. The single and double types are loaded into the NDP or the emulator where they are expanded to extended internally. The longer length of time to read or write an extended type is somewhat offset by the time to convert to/from the extended. Calculation times are therefore the same, no matter the type. The only advantage of the smaller types is the reduced storage space. Usually this will be important only in the case of large arrays or large files.

COMP is somewhat of a mongrel. It too, is apparently converted to extended for calculations, but is truncated at the end of each operation. (This could be checked by experimentation or by reading the documentation on the 803x7.) However the storage is as a 64 bit (63 plus sign) 1's compliment binary integer. TP treats input and output of COMP numbers as if it were an extended number.

Floating-Point in TP/BP Hardware ISRs

In a Hardware Interrupt Service Routine, it seems at best very difficult to use the same sort of floating-point routines as in the main program. If using the FPU throughout, it is necessary to save and restore the extended registers EAX..EDX; it is necessary to save and restore the FPU state; but these are found to be not sufficient.

AIUI, the problem is that, in doing certain FP operations, the FPU is assisted by a small amount of "helper" x86 code, and this code is not re-entrant. The helper, in acting for a FPU operation in the ISR, can corrupt an action which it is in the middle of performing for the interrupted code. See also Pascal/Delphi Rounding and Truncation.

The solution which worked for me was to use in the main program only the IEEE types (single, double, extended), and in the ISR only the old 6-byte type (real). Transfer of values between IEEE & 6-byte types was done in the main program; it was protected by disabling interrupts during transfer, lest the source value changed during the non-atomic transfer. The ISR was in a separate unit, compiled with {$N-,E-}, and all possible checks off. (* Effect of set87=y/n ? *)

The code worked, apparently properly, with BP7.01 in DOS & DPMI modes, either without Windows or in a WfWg3.11 DOS box (though the interrupt rate limit was much lower there).

Related Information

I understand that Trunc, Int & Frac can cause problems, because they alter the FPU control word, saving the old value unsafely.

Output as Text

I have code to output a non-negative longint as English text, perhaps to be extended to include decimal places, in cashtext.pas (under-tested).

Home Page

Mail: no HTML

These pages are tested mainly with MS IE 7 and Firefox 3.0 and W3's Tidy.
This site, http://www.merlyn.demon.co.uk/, is maintained by me.

Head .