To start off, the main difference between your two tests is definitely in bounds check elimination; however, the way this influences the machine code is far from what the na?ve expectation would suggest.
My conjecture:
The bounds check figures more strongly as a loop exit point than as additional code which introduces overhead.
The loop exit point prevents the following optimization which I have culled from the emitted machine code:
- the loop is unrolled (this is true in all cases);
- additionaly, the fetching from the array stage is done first for all unrolled steps, then the xoring into accumulator is done for all the steps.
If the loop can break out at any step, this staging would result in work performed for loop steps which were never actually taken.
Consider this slight modification of your code:
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.N)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(1)
public class Measure {
public static final int N = 1024;
private final int[] table = new int[N];
@Setup public void setUp() {
final Random random = new Random();
for (int i = 0; i < table.length; ++i) {
final int x = random.nextInt();
table[i] = x == 0? 1 : x;
}
}
@GenerateMicroBenchmark public int normalIndex() {
int result = 0;
final int[] table = this.table;
int x = 0;
for (int i = 0; i <= table.length - 1; ++i) {
x += i;
final int j = x & (table.length - 1);
final int entry = table[i];
result ^= entry + j;
if (entry == 0) break;
}
return result;
}
@GenerateMicroBenchmark public int maskedIndex() {
int result = 0;
final int[] table = this.table;
int x = 0;
for (int i = 0; i <= table.length - 1; ++i) {
x += i;
final int j = x & (table.length - 1);
final int entry = table[j];
result ^= i + entry;
if (entry == 0) break;
}
return result;
}
}
There is just one difference: I have added the check
if (entry == 0) break;
to give the loop a way to exit prematurely on any step. (I also introduced a guard to ensure no array entries are actually 0.)
On my machine, this is the result:
Benchmark Mode Samples Mean Mean error Units
o.s.Measure.maskedIndex avgt 5 1.378 0.229 ns/op
o.s.Measure.normalIndex avgt 5 0.924 0.092 ns/op
the "normal index" variant is substantially faster, as generally expected.
However, let us remove the additional check:
// if (entry == 0) break;
Now my results are these:
Benchmark Mode Samples Mean Mean error Units
o.s.Measure.maskedIndex avgt 5 1.130 0.065 ns/op
o.s.Measure.normalIndex avgt 5 1.229 0.053 ns/op
"Masked index" responded predictably (reduced overhead), but "normal index" is suddenly much worse. This is apparently due to a bad fit between the additional optimization step and my specific CPU model.
My point:
The performance model at such a detailed level is very unstable and, as witnessed on my CPU, even erratic.