What Every Computer Scientist Should Know About Floating-Point Arithmetic

The setprecision() method in C++ is used to set the precision of floating points up to a certain number of significant digits. It is mainly useful in cases where the resulting number is irrational and repeating, i.e., where the decimal points repeat themselves after reaching a certain point. This function can be used in the program by adding the iomanip header file. The preceding examples should not be taken to suggest that extended precision per se is harmful.

The IEEE standard defines rounding very precisely, and it depends on the current value of the rounding modes. This sometimes conflicts with the definition of implicit rounding in type conversions or the explicit round function in languages. This means that programs which wish to use IEEE rounding can’t use the dorm room whiteboard natural language primitives, and conversely the language primitives will be inefficient to implement on the ever increasing number of IEEE machines. When a floating-point calculation is performed using interval arithmetic, the final answer is an interval that contains the exact result of the calculation.

The topics discussed up to now have primarily concerned systems implications of accuracy and precision. The IEEE standard strongly recommends that users be able to specify a trap handler for each of the five classes of exceptions, and the section Trap Handlers, gave some applications of user defined trap handlers. In the case of invalid operation and division by zero exceptions, the handler should be provided with the operands, otherwise, with the exactly rounded result. Depending on the programming language being used, the trap handler might be able to access other variables in the program as well.

The first section, Rounding Error, discusses the implications of using different rounding strategies for the basic operations of addition, subtraction, multiplication and division. It also contains background information on the two methods of measuring rounding error, ulps and relative error. The second part discusses the IEEE floating-point standard, which is becoming rapidly accepted by commercial hardware manufacturers. Included in the IEEE standard is the rounding method for basic operations. The discussion of the standard draws on the material in the section Rounding Error. The third part discusses the connections between floating-point and the design of various aspects of computer systems.

Again, the language must provide environmental parameters so that the program can determine the range and precision of the widest available format. Unfortunately, the IEEE standard does not guarantee that the same program will deliver identical results on all conforming systems. Most programs will actually produce different results on different systems for a variety of reasons. For one, most programs involve the conversion of numbers between decimal and binary formats, and the IEEE standard does not completely specify the accuracy with which such conversions must be performed. For another, many programs use elementary functions supplied by a system library, and the standard doesn’t specify these functions at all.

A formal proof of Theorem 8, taken from Knuth page 572, appears in the section Theorem 14 and Theorem 8.” The advantage of presubstitution is that it has a straightforward hardware implementation.29 As soon as the type of exception has been determined, it can be used to index a table which contains the desired result of the operation. Although presubstitution has some attractive attributes, the widespread acceptance of the IEEE standard makes it unlikely to be widely implemented by hardware manufacturers.

Unfortunately, when it comes to floating-point arithmetic, the goal is virtually impossible to achieve. The authors of the IEEE standards knew that, and they didn’t attempt to achieve it. As a result, despite nearly universal conformance to the IEEE 754 standard throughout the computer industry, programmers of portable software must continue to cope with unpredictable floating-point arithmetic. Extended precision in the IEEE standard serves a similar function. However, when using extended precision, it is important to make sure that its use is transparent to the user.

We next turn to an analysis of the formula for the area of a triangle. In order to estimate the maximum error that can occur when computing with , the following fact will be needed. To summarize, instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set. Some of the implications of this for compilers are discussed in the next section. The design of almost every aspect of a computer system requires knowledge about floating-point. Computer system designers rarely get guidance from numerical analysis texts, which are typically aimed at users and writers of software, not at computer designers.

Similar Posts