Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
984 views
in Technique[技术] by (71.8m points)

c++ - Atomically clearing lowest non-zero bit of an unsigned integer

Question:
I'm looking for the best way to clear the lowest non-zero bit of a unsigned atomic like std::atomic_uint64_t in a threadsafe fashion without using an extra mutex or the like. In addition, I also need to know, which bit got cleared.

Example: Lets say, if the current value stored is 0b0110 I want to know that the lowest non-zero bit is bit 1 (0-indexed) and set the variable to 0b0100.

The best version I came up with is this:

#include <atomic>
#include <cstdint>

inline uint64_t with_lowest_non_zero_cleared(std::uint64_t v){
    return v-1 & v;
}
inline uint64_t only_keep_lowest_non_zero(std::uint64_t v){
    return v & ~with_lowest_non_zero_cleared(v);
}

uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(!foo.compare_exchange_weak(expected,with_lowest_non_zero_cleared(expected), std::memory_order_seq_cst, std::memory_order_relaxed))
    {;}
    return only_keep_lowest_non_zero(expected);
}

is there a better solution?

Notes:

  • It doesn't have to be the lowest non-zero bit - I'd also be happy with a solution that clears the highest bit or sets the highest/lowest "zero bit" if that makes a difference
  • I'd very much prefer a standard c++ (17) version, but would acccept an answer that uses intrinsics for clang and msvc
  • It should be portable (compiling with clang or msvc for x64 and AArch64), but I'm most interested in the performance on recent intel and AMD x64 architectures when compiled with clang.
  • EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr is only available in memory-source form, not memory-destination form; lock blsr [shared_var] does not exist. (Compilers know how to optimize (var-1) & (var) into blsr for local vars when you compile with -march=haswell or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support1, it doesn't let you do anything fundamentally different.

The best you can do on x86 is the lock cmpxchg retry loop you propose in the question. This is a better option than finding the right bit in an old version of the variable and then using lock btr, especially if it would be a correctness problem to clear the wrong bit if a lower bit was set between the bit-scan and the lock btr. And you'd still need a retry loop in case another thread already cleared the bit you wanted.

CAS retry loops are not bad in practice: retry is quite rare

(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)

The window for the CPU to lose the cache line between the load to get expected and running lock cmpxchg with the result of a couple instructions on that value is tiny. Usually it will succeed the first time through, because the cache line will still be present in L1d cache when the cmpxchg runs. When the cache line arrives, it will (hopefully) already be in MESI Exclusive state, if the CPU saw far enough ahead to do an RFO for it.

You can instrument your cmpxchg retry loops to see how much contention you actually get in your real program. (e.g. by incrementing a local inside the loop and using an atomic relaxed += into a shared counter once you succeed, or with thread-local counters.)

Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.

EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).

A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.

CAS leaves the cache line unmodified if it fails. On x86 it dirties the cache line (MESI M state), but x86 cmpxchg doesn't detect ABA, it only compares, so one other thread that loaded the same expected will succeed. On LL/SC machines, hopefully an SC failure on one core doesn't cause surious SC failures on other cores, otherwise it could livelock. That's something you can assume CPU architects thought of.

It's not wait-free because a single thread can in theory have to retry an unbounded number of times.


Your code compiles with gcc8.1 -O3 -march=haswell to this asm (from the Godbolt compiler explorer)

# gcc's code-gen for x86 with BMI1 looks optimal to me.  No wasted instructions!
# presumably you'll get something similar when inlining.
pop_lowest_non_zero(std::atomic<unsigned long>&):
    mov     rax, QWORD PTR [rdi]
.L2:
    blsr    rdx, rax                      # Bit Lowest Set Reset
    lock cmpxchg    QWORD PTR [rdi], rdx
    jne     .L2                           # fall through on success: cmpxchg sets ZF on equal like regular cmp
    blsi    rax, rax                      # Bit Lowest Set Isolate
    ret

Without BMI1, blsr and blsi become two instructions each. This is close to irrelevant for performance given the cost of a locked instruction.

clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely() macro. How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?).


Footnote 1:

Unless you're building binaries for a known set of machines, you can't assume BMI1 is available anyway. So the compiler will need to do something like lea rdx, [rax-1] / and rdx, rax instead of blsr rdx, rax. This makes a trivial difference for this function; the majority of the cost is the atomic operation even in the uncontended case, even for the hot-in-L1d cache no contention case, looking at the out-of-order execution throughput impact on surrounding code. (e.g. 10 uops for lock cmpxchg on Skylake (http://agner.org/optimize/) vs. saving 1 uop with blsr instead of 2 other instructions. Assuming the front-end is the bottleneck, rather than memory or something else. Being a full memory barrier has an impact on loads/stores in surrounding code, too, but fortunately not on out-of-order execution of independent ALU instructions.)


Non-x86: LL/SC machines

Most non-x86 machines do all their atomic operations with load-linked / store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x inside an LL/SC instead of just the add that you'd get from using fetch_add), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)

So you just have to write code using compare_exchange_weak if C++ doesn't provide the operation you want (e.g. fetch_or or fetch_and).

What you get in practice from current compilers is a compare_exchange_weak implemented with LL/SC, and your arithmetic operation separate from that. The asm actually does the relaxed load before the exclusive-load-acquire (ldaxr), instead of just basing the computation on the ldaxr result. And there's extra branching to check that expected from the last attempt matches the new load result before attempting the store.

The LL/SC window is maybe shorter than with 2 dependent ALU instructions between the load and store, in case that matters. The CPU has the desired value ready ahead of time, not dependent on the load result. (It depends on the previous load result.) Clang's code-gen puts the sub/and inside the loop, but dependent on the previous iteration's load, so with out of order execution they can still be ready ahead of time.

But if that was really the most efficient way to do things, compilers should be using it for fetch_add as well. They don't because it probably isn't. LL/SC retries are rare, just like CAS retries on x86. I don't know if it could make a different for very-high-contention situations. Maybe, but compilers don't slow down the fast path to optimize for that when compiling fetch_add.

I renamed your functions and re-formatted the while() for readability, because it was already too long for one line without wrapping it with unlikely().

This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.

#include <atomic>
#include <cstdint>

#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...