c - Fast float quantize, scaled by precision?

Question

Welcome To Ask or Share your Answers For Others

c - Fast float quantize, scaled by precision?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Fast float quantize, scaled by precision?

Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.

A naive approach could be to detect the precision and scale it up:

float quantize(float value, float quantize_scale) {
    float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
    return floorf((value / factor) + 0.5f) * factor;
}

However this seems too heavy.

Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.

Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)

For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:07:12+0000

The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Sample code is below.

If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale in the code below is 2^b, then *x0 receives the high s-b bits of x, and *x1 receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated). If b is known at compile time, e.g., the constant 43, you can replace Scale with the appropriate constant, such as 0x1p43. Otherwise, you must produce 2^b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even.

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in the same precision as the value being separated. (double for double, float for float, and so on. If the compiler evaluates float expressions with double, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale adjusted correspondingly], and then convert back.)

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

Categories

c - Fast float quantize, scaled by precision?

c - Fast float quantize, scaled by precision?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags