Roundoff Errors
As observed in Chapter 1, various errors may arise in the process of calculating an approximate
solution for a mathematical model. In this chapter we introduce and discuss in detail the most
fundamental source of imperfection in numerical computing: roundoff errors. Such errors arise due
to the intrinsic limitation of the finite precision representation of numbers (except for a restricted set
of integers) in computers.
Different audiences may well require different levels of depth and detail in the present topic.
We therefore start our discussion in Section 2.1 with the bare bones: a collection of essential facts
related to floating point systems and roundoff errors that may be particularly useful for those wishing
to concentrate on the last seven chapters of this text.
Note: If you do not require detailed knowledge of roundoff errors and their propagation
during a computation, not even the essentials of Section 2.1, then you may skip this chap-
ter (not recommended), at least upon first reading. What you must accept, then, is the
notion that each number representation and each elementary operation (such as + or ∗)in
standard floating point arithmetic introduces a small, random relative error: up to about
in today’s standard floating point systems.
In Section 2.2 we get technical and dive into the gory details of floating point systems and
floating point arithmetic. Several issues only mentioned in Section 2.1 are explained here.
The small representation errors as well as errors that arise upon carrying out elementary op-
erations such as addition and multiplication are typically harmless unless they accumulate or get
magnified in an unfortunate way during the course of a possibly long calculation. We discuss round-
off error accumulation as well as ways to avoid or reduce such effects in Section 2.3.
Finally, in Section 2.4, the IEEE standard for floating point arithmetic, which is implemented
in any nonspecialized hardware, is briefly described.
2.1 The essentials
This section summarizes what we believe all our students should know as a minimum about floating
point arithmetic. Let us start with a motivating example.
Example 2.1. Scientists and engineers often wish to believe that the numerical results of a computer
calculation, especially those obtained as output of a software package, contain no error—at least not
a significant or intolerable one. But careless numerical computing does occasionally lead to trouble.
Note: The word “essential” is not synonymous with “easy.” If you find some part of the
description below too terse for comfort, then please refer to the relevant section in this
chapter for more motivation, detail, and explanation.
One of the more spectacular disasters was the Patriot missile failure in Dhahran, Saudi Arabia, on
February 25, 1991, which resulted in 28 deaths. This failure was ultimately traced to poor han-
dling of roundoff errors in the missile’s software.
Computer memory has a finite capacity. This obvious fact has far-reaching implications on the
representation of real numbers, which in general do not have a finite uniform. representation. How
should we then represent a real number on the computer in a way that can be sensibly implemented
in hardware?
Any real number x ∈ R is accurately representable by an infinite sequence of digits.
where e is an integer exponent. The (possibly infinite) set of binary digits {d
in a manner that will soon be
specified. Storing this fraction in memory thus requires t bits. The exponent e must also be stored in
a fixed number of bits and therefore must be bounded, say, between a lower bound L and an upper
bound U. Further details are given in Section 2.2, which we hope will provide you with much of
what you’d like to know.
Rounding unit and standard floating point system
Without any further details it should already be clear that representing x by fl(x) necessarily causes
an error. A central question is how accurate the floating point representation of the real numbers is.
4
It is known from calculus that the set of all rational numbers in a given real interval is dense in that interval. This
means that any number in the interval, rational or not, can be approached to arbitrary accuracy by a sequence of rational
numbers.
Downloaded 01/27/18 to 132.174.254.159. Redistribution subject to SIAM license or copyright
This is called double precision; see the schematics in Figure 2.1.
Figure 2.1. A double word (64 bits) in the standard floating point system. The blue bit is
for sign, the magenta bits store the exponent, and the green bits are for the fraction.
If the latter name makes you feel a bit like starting house construction from the second floor,
then rest assured that there is also a single precision word, occupying 32 bits. This one obviously
has a much smaller number of digits t, hence a significantly larger rounding unit η. We will not
use single precision for calculations anywhere in this book except for Examples 2.2 and 14.6, and
neither should you.
Roundoff error accumulation
Even if number representations were exact in our floating point system, arithmetic operations in-
volving such numbers introduce roundoff errors. These errors can be quite large in the relative sense,
unless guard digits are used. These are extra digits that are used in interim calculations. The IEEE
standard requires exact rounding, which yields that the relative error in each arithmetic operation
is also bounded by η.
Given the above soothing words about errors remaining small after representing a number and
performing an arithmetic operation, can we really put our minds at ease and count on a long and
intense calculation to be as accurate as we want it to be?
Not quite! We have already seen in Example 1.3 that unpleasant surprises may arise. Let us
mention a few potential traps. The fuller version is in Section 2.3.
Careless computations can sometimes result in division by 0 or another form. of undefined
numerical result. The corresponding variable name is then assigned the infamous designation NaN.
This is a combination of letters that stands for “not a number,” which naturally one dreads to see in
Downloaded 01/27/18 to 132.174.254.159. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
20 Chapter 2. Roundoff Errors
one’s own calculations, but it allows software to detect problematic situations, such as an attempt
to divide 0 by 0, and do something graceful instead of just halting. We have a hunch that you will
inadvertently encounter a few NaN’s before you finish implementing all the algorithms in this book.
There is also a potential for an overflow, which occurs when a number is larger than the
largest that can be represented in the floating point system. Occasionally this can be avoided, as in
Example 2.9 and other examples in Section 2.3.
We have already mentioned in Section 1.3 that a roundoff error accumulation that grows lin-
early with the number of operations in a calculation is inevitable. Our calculations will never be
so long that this sort of error would surface as a practical issue. Still, there are more pitfalls to
watch out for. One painful type of error magnification is a cancellation error, which occurs when
two nearly equal numbers are subtracted from one another. There are several examples of this in
Section 2.3. Furthermore, the discussion in this chapter suggests that such errors may consistently
arise in practice.
The rough appearance of roundoff error
Consider a smooth function g, sampled at some t and at t + h for a small (i.e., near 0) value h.
Continuity then implies that the values g(t)andg(t +h) are close to each other. But the rounding
errors in the machine representation of these two numbers are unrelated to each other: they are
random for all we know! These rounding errors are both small (that’s what η is for), but even their
signs may in fact differ. So when subtracting g(t +h)− g(t), for instance, on our way to estimate
the derivative of g at t, the relative roundoff error becomes significantly larger as cancellation error
naturally arises. This is apparent already in Figure 1.3.
Let us take a further look at the seemingly unstructured behavior. of roundoff error as a smooth
function is being sampled.
Example 2.2. We evaluate g(t)=e
Thus, the definition of the array values tt is automatically implemented in double precision,
while the instructionsinglewhen defining the arrayrtrecords the corresponding values in single
precision.
The resulting plot is depicted in Figure 2.2. Note the disorderly, “high frequency” oscillation
of the roundoff error. This is in marked contrast to discretization errors, which are usually “smooth,”
as we have seen in Example 1.2. (Recall the straight line drop of the error in Figure 1.3 for relatively
large h, which is where the discretization error dominates.)
The output of this program indicates that, as expected, the relative error is at about the
rounding unit level. The latter is obtained by the (admittedly unappealing) function call
Downloaded 01/27/18 to 132.174.254.159. Redistribution subject to SIAM license or copyright;