| $NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $ | |
| SoftFloat Release 2a General Documentation | |
| John R. Hauser | |
| 1998 December 13 | |
| ------------------------------------------------------------------------------- | |
| Introduction | |
| SoftFloat is a software implementation of floating-point that conforms to | |
| the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four | |
| formats are supported: single precision, double precision, extended double | |
| precision, and quadruple precision. All operations required by the standard | |
| are implemented, except for conversions to and from decimal. | |
| This document gives information about the types defined and the routines | |
| implemented by SoftFloat. It does not attempt to define or explain the | |
| IEC/IEEE Floating-Point Standard. Details about the standard are available | |
| elsewhere. | |
| ------------------------------------------------------------------------------- | |
| Limitations | |
| SoftFloat is written in C and is designed to work with other C code. The | |
| SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt | |
| has been made to accommodate compilers that are not ISO-conformant. In | |
| particular, the distributed header files will not be acceptable to any | |
| compiler that does not recognize function prototypes. | |
| Support for the extended double-precision and quadruple-precision formats | |
| depends on a C compiler that implements 64-bit integer arithmetic. If the | |
| largest integer format supported by the C compiler is 32 bits, SoftFloat is | |
| limited to only single and double precisions. When that is the case, all | |
| references in this document to the extended double precision, quadruple | |
| precision, and 64-bit integers should be ignored. | |
| ------------------------------------------------------------------------------- | |
| Contents | |
| Introduction | |
| Limitations | |
| Contents | |
| Legal Notice | |
| Types and Functions | |
| Rounding Modes | |
| Extended Double-Precision Rounding Precision | |
| Exceptions and Exception Flags | |
| Function Details | |
| Conversion Functions | |
| Standard Arithmetic Functions | |
| Remainder Functions | |
| Round-to-Integer Functions | |
| Comparison Functions | |
| Signaling NaN Test Functions | |
| Raise-Exception Function | |
| Contact Information | |
| ------------------------------------------------------------------------------- | |
| Legal Notice | |
| SoftFloat was written by John R. Hauser. This work was made possible in | |
| part by the International Computer Science Institute, located at Suite 600, | |
| 1947 Center Street, Berkeley, California 94704. Funding was partially | |
| provided by the National Science Foundation under grant MIP-9311980. The | |
| original version of this code was written as part of a project to build | |
| a fixed-point vector processor in collaboration with the University of | |
| California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. | |
| THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort | |
| has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT | |
| TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO | |
| PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY | |
| AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. | |
| ------------------------------------------------------------------------------- | |
| Types and Functions | |
| When 64-bit integers are supported by the compiler, the `softfloat.h' header | |
| file defines four types: `float32' (single precision), `float64' (double | |
| precision), `floatx80' (extended double precision), and `float128' | |
| (quadruple precision). The `float32' and `float64' types are defined in | |
| terms of 32-bit and 64-bit integer types, respectively, while the `float128' | |
| type is defined as a structure of two 64-bit integers, taking into account | |
| the byte order of the particular machine being used. The `floatx80' type | |
| is defined as a structure containing one 16-bit and one 64-bit integer, with | |
| the machine's byte order again determining the order of the `high' and `low' | |
| fields. | |
| When 64-bit integers are _not_ supported by the compiler, the `softfloat.h' | |
| header file defines only two types: `float32' and `float64'. Because | |
| ISO/ANSI C guarantees at least one built-in integer type of 32 bits, | |
| the `float32' type is identified with an appropriate integer type. The | |
| `float64' type is defined as a structure of two 32-bit integers, with the | |
| machine's byte order determining the order of the fields. | |
| In either case, the types in `softfloat.h' are defined such that if a system | |
| implements the usual C `float' and `double' types according to the IEC/IEEE | |
| Standard, then the `float32' and `float64' types should be indistinguishable | |
| in memory from the native `float' and `double' types. (On the other hand, | |
| when `float32' or `float64' values are placed in processor registers by | |
| the compiler, the type of registers used may differ from those used for the | |
| native `float' and `double' types.) | |
| SoftFloat implements the following arithmetic operations: | |
| -- Conversions among all the floating-point formats, and also between | |
| integers (32-bit and 64-bit) and any of the floating-point formats. | |
| -- The usual add, subtract, multiply, divide, and square root operations | |
| for all floating-point formats. | |
| -- For each format, the floating-point remainder operation defined by the | |
| IEC/IEEE Standard. | |
| -- For each floating-point format, a ``round to integer'' operation that | |
| rounds to the nearest integer value in the same format. (The floating- | |
| point formats can hold integer values, of course.) | |
| -- Comparisons between two values in the same floating-point format. | |
| The only functions required by the IEC/IEEE Standard that are not provided | |
| are conversions to and from decimal. | |
| ------------------------------------------------------------------------------- | |
| Rounding Modes | |
| All four rounding modes prescribed by the IEC/IEEE Standard are implemented | |
| for all operations that require rounding. The rounding mode is selected | |
| by the global variable `float_rounding_mode'. This variable may be set | |
| to one of the values `float_round_nearest_even', `float_round_to_zero', | |
| `float_round_down', or `float_round_up'. The rounding mode is initialized | |
| to nearest/even. | |
| ------------------------------------------------------------------------------- | |
| Extended Double-Precision Rounding Precision | |
| For extended double precision (`floatx80') only, the rounding precision | |
| of the standard arithmetic operations is controlled by the global variable | |
| `floatx80_rounding_precision'. The operations affected are: | |
| floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt | |
| When `floatx80_rounding_precision' is set to its default value of 80, these | |
| operations are rounded (as usual) to the full precision of the extended | |
| double-precision format. Setting `floatx80_rounding_precision' to 32 | |
| or to 64 causes the operations listed to be rounded to reduced precision | |
| equivalent to single precision (`float32') or to double precision | |
| (`float64'), respectively. When rounding to reduced precision, additional | |
| bits in the result significand beyond the rounding point are set to zero. | |
| The consequences of setting `floatx80_rounding_precision' to a value other | |
| than 32, 64, or 80 is not specified. Operations other than the ones listed | |
| above are not affected by `floatx80_rounding_precision'. | |
| ------------------------------------------------------------------------------- | |
| Exceptions and Exception Flags | |
| All five exception flags required by the IEC/IEEE Standard are | |
| implemented. Each flag is stored as a unique bit in the global variable | |
| `float_exception_flags'. The positions of the exception flag bits within | |
| this variable are determined by the bit masks `float_flag_inexact', | |
| `float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and | |
| `float_flag_invalid'. The exception flags variable is initialized to all 0, | |
| meaning no exceptions. | |
| An individual exception flag can be cleared with the statement | |
| float_exception_flags &= ~ float_flag_<exception>; | |
| where `<exception>' is the appropriate name. To raise a floating-point | |
| exception, the SoftFloat function `float_raise' should be used (see below). | |
| In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess | |
| for underflow either before or after rounding. The choice is made by | |
| the global variable `float_detect_tininess', which can be set to either | |
| `float_tininess_before_rounding' or `float_tininess_after_rounding'. | |
| Detecting tininess after rounding is better because it results in fewer | |
| spurious underflow signals. The other option is provided for compatibility | |
| with some systems. Like most systems, SoftFloat always detects loss of | |
| accuracy for underflow as an inexact result. | |
| ------------------------------------------------------------------------------- | |
| Function Details | |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | |
| Conversion Functions | |
| All conversions among the floating-point formats are supported, as are all | |
| conversions between a floating-point format and 32-bit and 64-bit signed | |
| integers. The complete set of conversion functions is: | |
| int32_to_float32 int64_to_float32 | |
| int32_to_float64 int64_to_float32 | |
| int32_to_floatx80 int64_to_floatx80 | |
| int32_to_float128 int64_to_float128 | |
| float32_to_int32 float32_to_int64 | |
| float32_to_int32 float64_to_int64 | |
| floatx80_to_int32 floatx80_to_int64 | |
| float128_to_int32 float128_to_int64 | |
| float32_to_float64 float32_to_floatx80 float32_to_float128 | |
| float64_to_float32 float64_to_floatx80 float64_to_float128 | |
| floatx80_to_float32 floatx80_to_float64 floatx80_to_float128 | |
| float128_to_float32 float128_to_float64 float128_to_floatx80 | |
| Each conversion function takes one operand of the appropriate type and | |
| returns one result. Conversions from a smaller to a larger floating-point | |
| format are always exact and so require no rounding. Conversions from 32-bit | |
| integers to double precision and larger formats are also exact, and likewise | |
| for conversions from 64-bit integers to extended double and quadruple | |
| precisions. | |
| Conversions from floating-point to integer raise the invalid exception if | |
| the source value cannot be rounded to a representable integer of the desired | |
| size (32 or 64 bits). If the floating-point operand is a NaN, the largest | |
| positive integer is returned. Otherwise, if the conversion overflows, the | |
| largest integer with the same sign as the operand is returned. | |
| On conversions to integer, if the floating-point operand is not already an | |
| integer value, the operand is rounded according to the current rounding | |
| mode as specified by `float_rounding_mode'. Because C (and perhaps other | |
| languages) require that conversions to integers be rounded toward zero, the | |
| following functions are provided for improved speed and convenience: | |
| float32_to_int32_round_to_zero float32_to_int64_round_to_zero | |
| float64_to_int32_round_to_zero float64_to_int64_round_to_zero | |
| floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero | |
| float128_to_int32_round_to_zero float128_to_int64_round_to_zero | |
| These variant functions ignore `float_rounding_mode' and always round toward | |
| zero. | |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | |
| Standard Arithmetic Functions | |
| The following standard arithmetic functions are provided: | |
| float32_add float32_sub float32_mul float32_div float32_sqrt | |
| float64_add float64_sub float64_mul float64_div float64_sqrt | |
| floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt | |
| float128_add float128_sub float128_mul float128_div float128_sqrt | |
| Each function takes two operands, except for `sqrt' which takes only one. | |
| The operands and result are all of the same type. | |
| Rounding of the extended double-precision (`floatx80') functions is affected | |
| by the `floatx80_rounding_precision' variable, as explained above in the | |
| section _Extended_Double-Precision_Rounding_Precision_. | |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | |
| Remainder Functions | |
| For each format, SoftFloat implements the remainder function according to | |
| the IEC/IEEE Standard. The remainder functions are: | |
| float32_rem | |
| float64_rem | |
| floatx80_rem | |
| float128_rem | |
| Each remainder function takes two operands. The operands and result are all | |
| of the same type. Given operands x and y, the remainder functions return | |
| the value x - n*y, where n is the integer closest to x/y. If x/y is exactly | |
| halfway between two integers, n is the even integer closest to x/y. The | |
| remainder functions are always exact and so require no rounding. | |
| Depending on the relative magnitudes of the operands, the remainder | |
| functions can take considerably longer to execute than the other SoftFloat | |
| functions. This is inherent in the remainder operation itself and is not a | |
| flaw in the SoftFloat implementation. | |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | |
| Round-to-Integer Functions | |
| For each format, SoftFloat implements the round-to-integer function | |
| specified by the IEC/IEEE Standard. The functions are: | |
| float32_round_to_int | |
| float64_round_to_int | |
| floatx80_round_to_int | |
| float128_round_to_int | |
| Each function takes a single floating-point operand and returns a result of | |
| the same type. (Note that the result is not an integer type.) The operand | |
| is rounded to an exact integer according to the current rounding mode, and | |
| the resulting integer value is returned in the same floating-point format. | |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | |
| Comparison Functions | |
| The following floating-point comparison functions are provided: | |
| float32_eq float32_le float32_lt | |
| float64_eq float64_le float64_lt | |
| floatx80_eq floatx80_le floatx80_lt | |
| float128_eq float128_le float128_lt | |
| Each function takes two operands of the same type and returns a 1 or 0 | |
| representing either _true_ or _false_. The abbreviation `eq' stands for | |
| ``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands | |
| for ``less than'' (<). | |
| The standard greater-than (>), greater-than-or-equal (>=), and not-equal | |
| (!=) functions are easily obtained using the functions provided. The | |
| not-equal function is just the logical complement of the equal function. | |
| The greater-than-or-equal function is identical to the less-than-or-equal | |
| function with the operands reversed; and the greater-than function can be | |
| obtained from the less-than function in the same way. | |
| The IEC/IEEE Standard specifies that the less-than-or-equal and less-than | |
| functions raise the invalid exception if either input is any kind of NaN. | |
| The equal functions, on the other hand, are defined not to raise the invalid | |
| exception on quiet NaNs. For completeness, SoftFloat provides the following | |
| additional functions: | |
| float32_eq_signaling float32_le_quiet float32_lt_quiet | |
| float64_eq_signaling float64_le_quiet float64_lt_quiet | |
| floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet | |
| float128_eq_signaling float128_le_quiet float128_lt_quiet | |
| The `signaling' equal functions are identical to the standard functions | |
| except that the invalid exception is raised for any NaN input. Likewise, | |
| the `quiet' comparison functions are identical to their counterparts except | |
| that the invalid exception is not raised for quiet NaNs. | |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | |
| Signaling NaN Test Functions | |
| The following functions test whether a floating-point value is a signaling | |
| NaN: | |
| float32_is_signaling_nan | |
| float64_is_signaling_nan | |
| floatx80_is_signaling_nan | |
| float128_is_signaling_nan | |
| The functions take one operand and return 1 if the operand is a signaling | |
| NaN and 0 otherwise. | |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | |
| Raise-Exception Function | |
| SoftFloat provides a function for raising floating-point exceptions: | |
| float_raise | |
| The function takes a mask indicating the set of exceptions to raise. No | |
| result is returned. In addition to setting the specified exception flags, | |
| this function may cause a trap or abort appropriate for the current system. | |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | |
| ------------------------------------------------------------------------------- | |
| Contact Information | |
| At the time of this writing, the most up-to-date information about | |
| SoftFloat and the latest release can be found at the Web page `http:// | |
| HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'. | |