Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
967 views
in Technique[技术] by (71.8m points)

math - Multiword addition in C

I have a C program which uses GCC's __uint128_t which is great, but now my needs have grown beyond it.

What are my options for fast arithmetic with 196 or 256 bits?

The only operation I need is addition (and I don't need the carry bit, i.e., I will be working mod 2192 or 2256).

Speed is important, so I don't want to move to a general multi-precision if at all possible. (In fact my code does use multi-precision in some places, but this is in the critical loop and will run tens of billions of times. So far the multi-precision needs to run only tens of thousands of times.)

Maybe this is simple enough to code directly, or maybe I need to find some appropriate library.

What is your advice, Oh great Stack Overflow?

Clarification: GMP is too slow for my needs. Although I actually use multi-precision in my code it's not in the inner loop and runs less than 105 times. The hot loop runs more like 1012 times. When I changed my code (increasing a size parameter) so that the multi-precision part ran more often vs. the single-precision, I had a 100-fold slowdown (mostly due to memory management issues, I think, rather than extra μops). I'd like to get that down to a 4-fold slowdown or better.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

256-bit version

__uint128_t a[2], b[2], c[2];        // c = a + b
c[0] = a[0] + b[0];                  // add low part
c[1] = a[1] + b[1] + (c[0] < a[0]);  // add high part and carry

Edit: 192-bit version. This way you can eliminate the 128-bit comparison like what @harold's stated:

struct uint192_t {
    __uint128_t H;
    uint64_t L;
} a, b, c;  // c = a + b
c.L = a.L + b.L;
c.H = a.H + b.H + (c.L < a.L);

Alternatively you can use the integer overflow builtins or checked arithmetic builtins

bool carry = __builtin_uaddl_overflow(a.L, b.L, &c.L);
c.H = a.H + b.H + carry;

Demo on Godbolt


If you do a lot of additions in a loop you should consider using SIMD and/or running them in parallel with multithreading. For SIMD you may need change the layout of the type so that you can add all the low parts at once and all the high parts at once. Once possible solution is an array of struct of array as suggested here practical BigNum AVX/SSE possible?

SSE2:   llhhllhhllhhllhh
AVX2:   llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh

With AVX-512 you can add eight 64-bit values at once. So you can add eight 192-bit values in 3 instructions plus a few more for the carry. For more information read Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

With AVX-2 or AVX-512 you may also have very fast horizontal add so it may also worth a try for 256-bit even if you don't have parallel addition chains. But for 192-bit addition then 3 add/adc instructions would be much faster


There are also many libraries with a fixed-width integer type. For example Boost.Multiprecision

#include <boost/multiprecision/cpp_int.hpp>

using namespace boost::multiprecision;

uint256_t myUnsignedInt256 = 1;

Some other libraries:

  • ttmath: ttmath:UInt<3> (an int type with 3 limbs, which is 192 bits on 64-bit computers)
  • uint256_t

See also


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...