The IEEE-754 floating-point standard
The
IEEE-754 floating-point standard is a standard for representing and
manipulating floating-point quantities that is followed by all modern
computer systems. It defines several standard representations of
floating-point numbers, all of which have the following basic pattern
(the specific layout here is for 32-bit floats):
bit 31 30 23 22 0 S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
The
bit numbers are counting from the least-significant bit. The first bit
is the sign (0 for positive, 1 for negative). The following 8 bits are
the exponent in excess-127 binary notation; this means
that the binary pattern 01111111 = 127 represents an exponent of 0,
1000000 = 128, represents 1, 01111110 = 126 represents -1, and so forth.
The mantissa fits in the remaining 24 bits, with its leading 1
stripped off as described above.
Certain
numbers have a special representation. Because 0 cannot be represented
in the standard form (there is no 1 before the decimal point), it is
given the special representation 0 00000000 00000000000000000000000. (There is also a -0 = 1 00000000 00000000000000000000000, which looks equal to +0 but prints differently.) Numbers with exponents of 11111111 = 255 = 2128 represent non-numeric quantities such as "not a number" (NaN), returned by operations like (0.0/0.0) and positive or negative infinity. A table of some typical floating-point numbers (generated by the program float.c) is given below:
0 = 0 = 0 00000000 00000000000000000000000 -0 = -0 = 1 00000000 00000000000000000000000 0.125 = 0.125 = 0 01111100 00000000000000000000000 0.25 = 0.25 = 0 01111101 00000000000000000000000 0.5 = 0.5 = 0 01111110 00000000000000000000000 1 = 1 = 0 01111111 00000000000000000000000 2 = 2 = 0 10000000 00000000000000000000000 4 = 4 = 0 10000001 00000000000000000000000 8 = 8 = 0 10000010 00000000000000000000000 0.375 = 0.375 = 0 01111101 10000000000000000000000 0.75 = 0.75 = 0 01111110 10000000000000000000000 1.5 = 1.5 = 0 01111111 10000000000000000000000 3 = 3 = 0 10000000 10000000000000000000000 6 = 6 = 0 10000001 10000000000000000000000 0.1 = 0.10000000149011612 = 0 01111011 10011001100110011001101 0.2 = 0.20000000298023224 = 0 01111100 10011001100110011001101 0.4 = 0.40000000596046448 = 0 01111101 10011001100110011001101 0.8 = 0.80000001192092896 = 0 01111110 10011001100110011001101 1e+12 = 999999995904 = 0 10100110 11010001101010010100101 1e+24 = 1.0000000138484279e+24 = 0 11001110 10100111100001000011100 1e+36 = 9.9999996169031625e+35 = 0 11110110 10000001001011111001110 inf = inf = 0 11111111 00000000000000000000000 -inf = -inf = 1 11111111 00000000000000000000000 nan = nan = 0 11111111 10000000000000000000000
What this means in practice is that a 32-bit floating-point value (e.g. a float) can represent any number between 1.17549435e-38 and 3.40282347e+38, where the e
separates the (base 10) exponent. Operations that would create a
smaller value will underflow to 0 (slowly—IEEE 754 allows "denormalized"
floating point numbers with reduced precision for very small values)
and operations that would create a larger value will produce inf or -inf instead.
For a 64-bit double, the size of both the exponent and mantissa are larger; this gives a range from 1.7976931348623157e+308 to 2.2250738585072014e-308, with similar behavior on underflow and overflow.
Intel
processors internally use an even larger 80-bit floating-point format
for all operations. Unless you declare your variables as long double, this should not be visible to you from C
except that some operations that might otherwise produce overflow
errors will not do so, provided all the variables involved sit in
registers (typically the case only for local variables and function
parameters).
Error
In general, floating-point numbers are not exact: they are likely to contain round-off error because of the truncation of the mantissa to a fixed number of bits. This is particularly noticeable for large values (e.g. 1e+12 in the table above), but can also be seen in fractions with values that aren't powers of 2 in the denominator (e.g. 0.1).
Round-off error is often invisible with the default float output
formats, since they produce fewer digits than are stored internally, but
can accumulate over time, particularly if you subtract floating-point
quantities with values that are close (this wipes out the mantissa
without wiping out the error, making the error much larger relative to
the number that remains).
The easiest way to avoid accumulating error is to use high-precision floating-point numbers (this means using double instead of float). On modern CPUs there is little or no time penalty for doing so, although storing doubles instead of floats will take twice as much space in memory.
Note
that a consequence of the internal structure of IEEE 754 floating-point
numbers is that small integers and fractions with small numerators and
power-of-2 denominators can be represented exactly—indeed, the
IEEE 754 standard carefully defines floating-point operations so that
arithmetic on such exact integers will give the same answers as integer
arithmetic would (except, of course, for division that produces a
remainder). This fact can sometimes be exploited to get higher
precision on integer values than is available from the standard integer
types; for example, a double can represent any integer between -253 and 253 exactly, which is a much wider range than the values from 2^-31^ to 2^31^-1 that fit in a 32-bit int or long. (A 64-bit long long does better.) So double`
should be considered for applications where large precise integers are
needed (such as calculating the net worth in pennies of a billionaire.)
One
consequence of round-off error is that it is very difficult to test
floating-point numbers for equality, unless you are sure you have an
exact value as described above. It is generally not the case, for
example, that (0.1+0.1+0.1) == 0.3 in C. This can produce odd results if you try writing something like for(f = 0.0; f <= 0.3; f += 0.1): it will be hard to predict in advance whether the loop body will be executed with f = 0.3 or not. (Even more hilarity ensues if you write for(f = 0.0; f != 0.3; f += 0.1), which after not quite hitting 0.3
exactly keeps looping for much longer than I am willing to wait to see
it stop, but which I suspect will eventually converge to some constant
value of f large enough that adding 0.1
to it has no effect.) Most of the time when you are tempted to test
floats for equality, you are better off testing if one lies within a
small distance from the other, e.g. by testing fabs(x-y) <= fabs(EPSILON * y), where EPSILON
is usually some application-dependent tolerance. This isn't quite the
same as equality (for example, it isn't transitive), but it usually
closer to what you want.
for more details check the link
http://www.cprogramming.com/tutorial/floating_point/understanding_floating_point_representation.html
for more details check the link
http://www.cprogramming.com/tutorial/floating_point/understanding_floating_point_representation.html