Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
912 views
in Technique[技术] by (71.8m points)

performance - What do multiple values or ranges means as the latency for a single instruction?

I have a question about instruction latency on https://uops.info/.

For some instructions like PCMPEQB(XMM, M128) the latency in the table entry for Skylake is listed as [1;≤8]

I know a little about latency, but what i know is that it's just a single number !!! for example, 1 or 2 or 3 or ... but what is this [1;≤8] !!!??? It means latency depends on memory and it's between 1 and 8 ?

If it's true, when is it 1 .. when is it 3, etc?

For example, what is the latency for this :

pcmpeqb xmm0, xword [.my_aligned_data]

....
....

align 16
.my_aligned_data db 5,6,7,2,5,6,7,2,5,6,7,2,5,6,7,2

here what is the exact latency value for this pcmpeqb xmm0, xword [.my_aligned_data] ???

or for example,

PMOVMSKB (R32, XMM)

the latency for this instruction is (≤3) !!! what is meaning ?! is it meaning that the latency is between 1 and 3 ?? If it is, this instruction is just for registers !!! So when is it 1 vs any higher number?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Why 2 numbers, : separated?

The instruction has 2 inputs and 2 uops (unfused domain), so both inputs aren't needed at the same time. e.g. the memory address is needed for the load, but the vector register input isn't needed until the load is ready.

That's why there are 2 separate fields in the latency value.

Click on the latency number link in https://uops.info/ for the breakdown of which operand to which result has which latency.

https://www.uops.info/html-lat/SKL/PCMPEQB_XMM_M128-Measurements.html breaks it down for this specific instruction for Skylake, which has 2 inputs and one output (in the same operand as one of the inputs because this is the non-VEX version. (Fun fact: that lets it keep a uop micro-fused even with an indexed addressing mode on HSW and later, unlike the VEX version)):

Operand 1 (r/w): is the XMM Register
Operand 2 (r): Memory

  • Latency operand 1 → 1: 1
  • Latency operand 2 → 1 (address, base register): ≤8
  • Latency operand 2 → 1 (memory): ≤5

And below that there are the specific instruction sequences that were used to test this instruction.

This detailed breakdown is where uops.info testing really shines compared to any other testing results or published numbers, especially for multi-uop instructions like mul or shr reg, cl. e.g. for shifts, the latency from reg or shift count to output is only 1 cycle; the extra uops are just for FLAGS merging.


Variable latency: why ≤8

Store-forwarding latency is variable on SnB family, and address-generation / L1d Load-use latency can be as well (Is there a penalty when base+offset is in a different page than the base?). Notice this has a memory source operand. But that's not why the latency is listed as ≤ n.

The ≤n latency values are an upper limit, I think. It does not mean that the latency from that operand could be as low as 1.

I think they only give an upper bound in cases where they weren't able to definitively test accurately for a definite lower bound.

Instructions like PMOVMSKB (R32, XMM) that produce their output in a different domain than their input are very hard to pin down. You need to use other instructions to feed the output back into the input to create a loop-carried dependency chain, and it's hard to design experiments to pin the blame on one part of the chain vs. another.

But unlike InstLatx64, the people behind https://uops.info/ didn't just give up in those cases. Their tests are vastly better than nothing!

e.g. a store/reload has some latency but how do you choose which of it to blame on the store vs. the load? (A sensible choice would be to list the load's latency as the L1d load-use latency, but unfortunately that's not what Agner Fog chose. His load vs. store latencies are totally arbitrary, like divided in half or something, leading to insanely low load latencies that aren't the load-use latency :/)

There are different ways of getting data from integer regs back into XMM regs as an input dependency for pmovmskb: ALU via movd or pinsrb/w/d/q, or a load. Or on AVX512 CPUs, via kmov and then using a masked instruction. None of these are simple and you can't assume that load-use latency for a SIMD load will be the same as an integer load. (We know store-forwarding latency is higher.)

As @BeeOnRope comments, uops.info typically times a round trip, and the displayed latency is the value of the entire cycle, minus any known padding instructions, minus 1. For example, if you time a GP -> SIMD -> GP roundtrip at 4 cycles (no padding), both of those instructions will be shown as <= 3.

When getting an upper bound for each one, you presumably can assume that any instruction has at least 1 cycle latency. e.g. for a pmovmskb -> movd chain, you can assume that movd has at least 1 cycle of latency, so the pmovmskb latency is at most the round-trip latency minus 1. But really it's probably less.


https://www.uops.info/html-lat/SKL/DIVPD_XMM_M128-Measurements.html for example shows different "Chain latencies" for different experiments. e.g. for one of the 1 -> 1 tests that runs divpd and with ORPD and ANDPD creating a dep chain with the same dividend repeatedly, uops.info lists the known latency of those extra instruction in the dep chain. It lists that as Chain latency: ≥10. (It could theoretically be higher if resource conflicts or some other effect make it not always produce a result exactly 10 cycles after the divpd output was ready. The point of these experiments is to catch weird effects that we might not have expected.) So given the "Core cycles: 44.0" minus the chain latency of at least 10, we can say that the divpd latency is at most 34, with the rest of the dep chain accounting for the other 10 (but possibly more).

(34.0 seems high; maybe I'm misinterpreting something. The inputs do have lots of significant mantissa bits, vs. experiment 2 which I think is doing 1.0 / 1.0 with nothing else in the loop, measuring 6 cycle latency from XMM -> XMM as a best case.)

Note that I'm just talking about the xmm -> xmm case here, not their more complex tests that feed back the XMM output as a dependency for the address or for memory contents.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...