This is known as the ' Hamming Weight ', 'popcount' or 'sideways addition'.
(这就是所谓的“ 汉明重量 ”,“弹出计数”或“横向添加”。)
The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.
(“最佳”算法实际上取决于您所使用的CPU以及您的使用模式。)
Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors.
(一些CPU具有单个内置指令来执行此操作,而其他CPU具有作用于位向量的并行指令。)
The parallel instructions (like x86's popcnt
, on CPUs where it's supported) will almost certainly be fastest. (并行指令(例如x86的popcnt
,在受支持的CPU上)将几乎肯定是最快的。)
Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle ( citation needed ). (其他一些体系结构可能有一条慢指令,该指令由微码循环实现,该循环每个周期测试一位( 需要引用 )。)
A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop.
(如果您的CPU具有较大的缓存,并且/或者您正在紧凑的循环中执行大量这些指令,那么预填充的表查找方法可能会非常快。)
However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (但是,由于“高速缓存未命中”的代价,它可能会遭受损失,在这种情况下,CPU必须从主内存中获取某些表。)
If you know that your bytes will be mostly 0's or mostly 1's then there are very efficient algorithms for these scenarios.
(如果您知道字节将大部分为0或大多数为1,那么对于这些??情况,有非常有效的算法。)
I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'.
(我相信以下是一种非常好的通用算法,称为“并行”或“可变精度SWAR算法”。)
I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (eg using uint32_t for C++ and >>> in Java): (我已经用类似C的伪语言表示了这一点,您可能需要对其进行调整以使其适用于特定语言(例如,对于C ++使用uint32_t,而在Java中使用>>>):)
int numberOfSetBits(int i)
{
// Java: use >>> instead of >>
// C or C++: use uint32_t
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it.
(在所讨论的所有算法中,这都是最坏的情况,因此可以有效地处理您使用的所有使用模式或值。)
This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction.
(这种逐位SWAR算法可以并行化以一次在多个矢量元素中完成,而不是在单个整数寄存器中完成,以提高具有SIMD但没有可用的popcount指令的CPU的速度。)
(eg x86-64 code that has to run on any CPU, not just Nehalem or later.) ((例如,必须在任何CPU上运行的x86-64代码,而不仅仅是Nehalem或更高版本。))
However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel.
(但是,将向量指令用于popcount的最佳方法通常是通过使用变量改组在每个字节并行的同时对4位进行表查找。)
(The 4 bits index a 16 entry table held in a vector register). ((这4位索引了保存在向量寄存器中的16个条目表)。)
On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB
bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right .
(在Intel CPU上,硬件64位popcnt指令的性能可以比SSSE3 PSHUFB
位并行实现的性能高约2倍,但PSHUFB
是您的编译器正确使用它 。)
Otherwise SSE can come out significantly ahead. (否则,上证所可能会明显领先。)
Newer compiler versions are aware of the popcnt false dependency problem on Intel . (较新的编译器版本意识到Intel上的popcnt错误依赖项 问题 。)
References:
(参考文献:)
https://graphics.stanford.edu/~seander/bithacks.html
(https://graphics.stanford.edu/~seander/bithacks.html)
https://en.wikipedia.org/wiki/Hamming_weight
(https://zh.wikipedia.org/wiki/汉明重量)
http://gurmeet.net/puzzles/fast-bit-counting-routines/
(http://gurmeet.net/puzzles/fast-bit-counting-routines/)
http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)
(http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count))