Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
638 views
in Technique[技术] by (71.8m points)

optimization - SIMD instructions lowering CPU frequency

I read this article. It talked about why AVX-512 instruction:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core, or maybe the rest of the CPU to run slower because of how much power they use.

I think on Agner's blog also mentioned something similar (but I can't find the exact post).

I wonder what other instructions supported by Skylake have the similar effect that they will lower the power to maximize the throughput later? All the v prefixed instructions (such as vmovapd, vmulpd, vaddpd, vsubpd, vfmadd213pd)?

I am trying to compile a list of instructions to avoid when compiling my C++ application for Xeon Skylake.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The frequency impact depends on the width of the instruction and the instruction used.

There are three frequency levels, so-called licenses, from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo5, originally associated with AVX and AVX2 instructions1. L2 is a lower speed than L1, sometimes called "AVX-512 turbo".

The exact speeds for each license also depend on the number of active cores. For up to date tables, you can usually consult WikiChip. For example, the table for the Xeon Gold 5120 is here:

Xeon Gold 5120 Frequencies

The Normal, AVX2 and AVX512 rows correspond to the L0, L1 and L2 licenses respectively. Note that the relative slowdown for L1 and L2 licenses generally gets worse as the number of cores increase: for 1 or 2 active cores the L1 and L2 speeds are 97% and 91% of L0, but for 13 or 14 cores they are 85% and 62% respectively. This varies by chip, but the general trend is usually the same.

Those preliminaries out of the way, let's get to what I think you are asking: which instructions cause which licenses to be activated?

Here's a table, showing the implied license for instructions based on their width and their categorization as light or heavy:

   Width    Light   Heavy  
 --------- ------- ------- 
  Scalar    L0      N/A
  128-bit   L0      L0     
  256-bit   L0      L1*    
  512-bit   L1      L2*

*soft transition (see below)

So we immediately see that all scalar (non-SIMD) instructions and all 128-bit wide instructions2 always run at full speed in the L0 license.

256-bit instructions will run in L0 or L1, depending on whether they are light or heavy, and 512-bit instructions will run in L1 or L2 on the same basis.

So what is this light and heavy thing?

Light vs Heavy

It's easiest to start by explaining heavy instructions.

Heavy instructions are all SIMD instructions that need to run on the FP/FMA unit. Basically that's the majority of the FP instructions (those usually ending in ps or pd, like addpd) as well as integer multiplication instructions which largely start with vpmul or vpmad since SIMD integer multiplication actually runs on the SIMD unit, as well as vplzcnt(q|d) which apparently also runs on the FMA unit.

Given that, light instructions are everything else. In particular, integer arithmetic other than multiplication, logical instructions, shuffles/blends (including FP) and SIMD load and store are light.

Transitions

The L1 and L2 entries in the Heavy column are marked with an asterisk, like L1*. That's because these instructions cause a soft transition when they occur. The other L1 entry (for 512-bit light instructions) causes a hard transition. Here we'll discuss the two transition types.

Hard Transition

A hard transition occurs immediately as soon as any instruction with the given license executes4. The CPU stops, takes some halt cycles and enters the new mode.

Soft Transition

Unlike hard transitions, a soft transition doesn't occur immediately as soon as any instruction is executed. Rather, the instructions initially execute with a reduced throughput (as slow as 1/4 their normal rate), without changing the frequency. If the CPU decides that "enough" heavy instructions are executing per unit time, and a specific threshold is reached, a transition to the higher-numbered license occurs.

That is, the CPU understands that if only a few heavy instructions arrive, or even if many arrive but they aren't dense when considering other non-heavy instructions, it may not be worth reducing the frequency.

Guidelines

Given the above, we can establish some reasonable guidelines. You never have to be scared of 128-bit instructions, since they never cause license related3 downclocking.

Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you. Indeed, compilers already liberally insert 256-bit instructions when you use the appropriate -march option, especially for data movement and auto-vectorized loops.

Using heavy AVX/AVX2 instructions and light AVX-512 instructions is trickier, because you will run in the L1 licenses. If only a small part of your process (say 10%) can take advantage, it probably isn't worth slowing down the rest of your application. The penalties associated with L1 are generally moderate - but check the details for your chip.

Using heavy AVX-512 instructions is even trickier, because the L2 license comes with serious frequency penalties on most chips. On the other hand, it is important to note that only FP and integer multiply instructions fall into the heavy category, so as a practical matter a lot of integer 512-bit wide use will only incur the L1 license.


1 Although, as we'll see, this a bit of a misnomer because AVX-512 instructions can set the speed to this license, and some AVX/2 instructions don't.

2 128-bit wide means using xmm registers, regardless of what instruction set they were introduced in - mainstream AVX-512 contains 128-bit variants for most/all new instructions.

3 Note the weasel clause license related - you may certainly suffer other causes of downclocking, such as thermal, power or current limits, and it is possible that 128-bit instructions could trigger this, but I think it is fairly unlikely on a desktop or server system (low power, small form factor devices are another matter).

4 Evidently, we are talking only about transitions to a higher-level license, e.g., from L0 to L1 when a hard-transition L1 instruction executes. If you are already in L1 or L2 nothing happens - there is no transition if you are already in the same level and you don't transition to lower-numbered levels based on any specific instruction but rather running for a certain time without any instructions of the higher-numbered level.

5 Out of the two AVX2 turbo is more common, which I never really understood because 256-bit instructions are as much associated with AVX as compared to AVX2, and most of the heavy instructions which actually trigger AVX turbo (L1 license) are actually FP instructions in AVX, not AVX2. The only exception is AVX2 integer multiplies.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...