floating point - Is 3*x+x always exact?

Question

Welcome To Ask or Share your Answers For Others

floating point - Is 3*x+x always exact?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

floating point - Is 3*x+x always exact?

Assuming strict IEEE 754 (no excess precision) and round to nearest even mode, is 3*x+x always == 4*x (and thus exact in absence of overflow) and why?

I was not able to exhibit a counter-example, so I went into lengthy dicussion of every possible trailing bit pattern abc and rounding case, but I feel like I could have missed a case, and also missed a simpler demonstration...

I also have an intuition that this could be extended to (2^n-1) x + x == 2^n x and testing every combination of trailing bits in this case is not an option.

We should have (2^n - 1) x == 2^n x - x by property of IEEE 754 as long as n <= 54, but y-x+x == y is not generally true...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:07:49+0000

In the following, math shown in code format is computed with IEEE 754 in round-to-nearest mode, and math not in code format is exact.

Let p be the number of bits in the significand.

Let f be the factor 2ⁿ-1 for a positive integer n and be exactly representable (n ≤ p).

Let U(x) be the ULP of x. For normal values, U(x) ≤ 2^1-px.

Let t be f*x. If f*x is subnormal, then it is exactly fx. If it is normal, then t = fx+e for some |e| ≤ &half;U(fx) ≤ 2^-px. Note that if |e| is exactly half an ULP, then it must equal the lowest bit of x that is set (since otherwise e would have more than one bit set and could not be half of an ULP).

Substituting for f, t = (2ⁿ-1)x+e.

t+x = (2ⁿ-1)x+e+x = 2ⁿx+e.

Consider t+x. By IEEE-754 requirements of round-to-nearest, this must be within &half; an ULP of t+x, which we know to be 2ⁿx+e. Clearly 2ⁿx is representable (barring overflow), and |e| ≤ &half;U(fx) ≤ &half;U(2ⁿx). Therefore t+x must be 2ⁿx unless |e| is exactly half an ULP and the low bit of x’s significand is odd (since an even low bit wins the tie and gives 2ⁿx).

If n is 1, then f is 1, and e is 0. If 2 ≤ n, then |e| ≤ 1/4 U(2ⁿx) < &half;U(2ⁿx). So a case where |e| is half an ULP and x’s low bit is odd does not occur.

Therefore t+x must be 2ⁿx. (Overflow and NaN left as an exercise for the reader.)

Additionally, I tested exhaustively for IEEE-754 32-bit binary floating-point.

Categories

floating point - Is 3*x+x always exact?

floating point - Is 3*x+x always exact?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags