Typically, single and double precision floating point systems as described above are imple-
mented in hardware. There is also quadruple precision (128 bits), often implemented in software
and thus considerably slower, for applications that require very high precision (e.g., in semiconduc-
tor simulation, numerical relativity and astronomical calculations).
The fundamentally important exact rounding, mentioned in both Sections 2.1 and 2.2, has
a rather lengthy definition for its implementation, which stands in contrast to the cleanly stated
requirement of its result. We will not dive deeper into this.
Specific exercises for this section: Exercises 20–21.
Downloaded 01/27/18 to 132.174.254.159. Redistribution subject to SIAM license or copyright;
32 Chapter 2. Roundoff Errors
2.5 Exercises
0. Review questions
(a) What is a normalized floating point number and what is the purpose of normalization?
(b) A general floating point system is characterized by four values (β,t, L,U). Explain in a
few brief sentences the meaning and importance of each of these parameters.
(c) Write down the floating point representation of a given real number x in a decimal sys-
tem with t = 4, using (i) chopping and (ii) rounding.
(d) Define rounding unit (or machine precision) and explain its importance.
(e) Define overflow and underflow. Why is the former considered more damaging than the
latter?
(f) What is a cancellation error? Give an example of an application where it arises in a
natural way.
(g) What is the rounding unit for base β = 2andt = 52 digits?
(h) Under what circumstances could nonnormalized floating point numbers be desirable?
(i) Explain the storage scheme for single precision and double precision numbers in the
IEEE standard.
1. The fraction in a single precision word has 23 bits (alas, less than half the length of the double
precision word).
Show that the corresponding rounding unit is approximately 6×10
. Write a MATLAB pro-
gram that implements your formula and computes an approximation of f
prime
(1.2), for
h = 1e-20,1e-19,...,1.
(c) Explain the difference in accuracy between your results and the results reported in Ex-
ample 1.3.
3. (a) How many distinct positive numbers can be represented in a floating point system using
base β = 10, precision t = 2 and exponent range L =−9, U = 10?
(Assume normalized fractions and don’t worry about underflow.)
(b) How many normalized numbers are represented by the floating point system (β,t, L,U)?
Provide a formula in terms of these parameters.
4. Suppose a computer company is developing a new floating point system for use with their
machines. They need your help in answering a few questions regarding their system. Fol-
lowing the terminology of Section 2.2, the company’s floating point system is specified by
(β,t, L,U). Assume the following:
Downloaded 01/27/18 to 132.174.254.159. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
2.5. Exercises 33
• All floating point values are normalized (except the floating point representation of zero).
• All digits in the mantissa (i.e., fraction) of a floating point value are explicitly stored.
• The number 0 is represented by a float with a mantissa and an exponent of zeros. (Don’t
worry about special bit patterns for ±∞ and NaN.)
Here is your part:
(a) How many different nonnegative floating point values can be represented by this floating
point system?
(b) Same question for the actual choice (β,t, L,U) = (8,5,−100,100) (in decimal) which
the company is contemplating in particular.
(c) What is the approximate value (in decimal) of the largest and smallest positive numbers
that can be represented by this floating point system?
(d) What is the rounding unit?
5. (a) The number
= 1.14285714285714... obviously has no exact representation in any
decimal floating point system (β = 10) with finite precision t. Is there a finite floating
point system (i.e., some finite integer base β and precision t) in which this number does
have an exact representation? If yes, then describe such a system.
(b) Answer the same question for the irrational number π.
6. Write a MATLAB program that receives as input a number x and a parameter n and returns
x rounded to n decimal digits. Write your program so that it can handle an array as input,
returning an array of the same size in this case.
Use your program to generate numbers for Example 2.2, demonstrating the phenomenon de-
picted there without use of single precision.
7. Prove the Floating Point Representation Error Theorem on page 24.
8. Rewrite the script. of Example 2.8 without any use of loops, using vectorized operations in-
stead.
9. Suggest a way to determine approximately the rounding unit of your calculator. State the type
of calculator you have and the rounding unit you have come up with. If you do not have
a calculator, write a short MATLAB script. to show that your algorithm works well on the
standard IEEE floating point system.
10. The function f
have the same values, in exact arithmetic, for any given argument values x
and δ.
(d) Explain the difference in the results of the two calculations.
Downloaded 01/27/18 to 132.174.254.159. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
34 Chapter 2. Roundoff Errors
(b) Which of the two formulas is more suitable for numerical computation? Explain why,
and provide a numerical example in which the difference in accuracy is evident.
12. For the following expressions, state the numerical difficulties that may occur, and rewrite the
formulas in a way that is more suitable for numerical computation:
(a)
,wherex greatermuch 1.
(b)
radicalbigg
,wherea ≈ 0andb ≈ 1.
13. Consider the linear system
with a,b > 0; a negationslash= b.
(a) If a ≈ b, what is the numerical difficulty in solving this linear system?
(b) Suggest a numerically stable formula for computing z = x + y given a and b.
(c) Determine whether the following statement is true or false, and explain why:
“When a ≈ b, the problem of solving the linear system is ill-conditioned but
the problem of computing x + y is not ill-conditioned.”
14. Consider the approximation to the first derivative
The truncation (or discretization) error for this formula is O(h). Suppose that the absolute
error in evaluating the function f is bounded by ε and let us ignore the errors generated in
basic arithmetic operations.
(a) Show that the total computational error (truncation and rounding combined) is bounded
where M is a bound on | f
behavior. of the graph in Example 1.3. Make sure to explain the shape of the graph as
well as the value where the apparent minimum is attained.
Downloaded 01/27/18 to 132.174.254.159. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
2.5. Exercises 35
(d) It is not difficult to show, using Taylor expansions, that fprime
(x) can be approximated more
accurately (in terms of truncation error) by
For this approximation, the truncation error is O(h
2). Generate a graph similar to Fig-ure 1.3 (please generate only the solid line) for the same function and the same value
of x, namely, for sin(1.2), and compare the two graphs. Explain the meaning of your
results.
15. Suppose a machine with a floating point system (β,t, L,U) = (10,8,−50,50) is used to cal-
culate the roots of the quadratic equation
where a, b,andc are given, real coefficients.
For each of the following, state the numerical difficulties that arise if one uses the standard
formula for computing the roots. Explain how to overcome these difficulties (when possible).
16. Write a quadratic equation solver. Your MATLAB script. should get a,b,c as input, and
accurately compute the roots of the corresponding quadratic equation. Make sure to check
end cases such as a = 0, and consider ways to avoid an overflow and cancellation errors.
Implement your algorithm and demonstrate its performance on a few cases (for example, the
cases mentioned in Exercise 15). Show that your algorithm produces better results than the
standard formula for computing roots of a quadratic equation.
17. Write a MATLAB program that
(a) sums up 1/n for n = 1,2,...,10,000;
(b) rounds each number 1/n to 5 decimal digits and then sums them up in 5-digit decimal
arithmetic for n = 1,2,...,10,000;
(c) sums up the same rounded numbers (in 5-digit decimal arithmetic) in reverse order, i.e.,
for n = 10,000,...,2,1.
Compare the three results and explain your observations. For generating numbers with
the requested precision, you may want to do Exercise 6 first.
18. (a) Explain in detail how to avoid overflow when computing the lscript
2
-norm of a (possibly large
in size) vector.
(b) Write a MATLAB script. for computing the norm of a vector in a numerically stable
fashion. Demonstrate the performance of your code on a few examples.
19. In the statistical treatment of data one often needs to compute the quantities
Downloaded 01/27/18 to 132.174.254.159. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
(a) Which of the two methods to calculate s
is cheaper in terms of overall computational
cost? Assume ¯x has already been calculated and give the operation counts for these two
options.
(b) Which of the two methods is expected to give more accurate results for s
2
in general?
(c) Give a small example, using a decimal system with precision t = 2 and numbers of your
choice, to validate your claims.
20. With exact rounding, we know that each elementary operation has a relative error which is
bounded in terms of the rounding unit η; e.g., for two floating point numbers x and y,fl(x +
y) = (x + y)(1+epsilon1),|epsilon1|≤η. But is this true also for elementary functions such as sin, ln, and
exponentiation?
Consider exponentiation, which is performed according to the formula
21. The IEEE 754 (known as the floating point standard) specifies the 128-bit word as having 15
bits for the exponent.
What is the length of the fraction? What is the rounding unit? How many significant decimal
digits does this word have?
Why is quadruple precision more than twice as accurate as double precision, which is in turn
more than twice as accurate as single precision?
2.6 Additional notes
A lot of thinking in the early days of modern computing went into the design of floating point sys-
tems for computers and scientific calculators. Such systems should be economical (fast execution
in hardware) on one hand, yet they should also be reliable, accurate enough, and free of unusual
exception-handling conventions on the other hand. W. Kahan was particularly instrumental in such
efforts (and received a Turing award for his contributions), especially in setting up the IEEE stan-
dard. The almost universal adoption of this standard has significantly increased both reliability and
portability of numerical codes. See Kahan’s webpage for various interesting related documents:
http://www.cs.berkeley.edu/∼wkahan/.
A short, accessible textbook that discusses floating point systems in great detail is Over-
ton [58]. A comprehensive and thorough treatment of roundoff errors and many aspects of numerical
stability can be found in Higham [40].
The practical way of working with floating point arithmetic, which is to attempt to keep errors
“small enough” so as not be a bother, is hardly satisfactory from a theoretical point of view. Indeed,
what if we want to use a floating point calculation for the purpose of producing a mathematical
proof?! The nature of the latter is that a stated result should always—not just usually—hold true. A