I'm using VTUNE to look at the hotspots in my code. I'm down into the vector instructions and confused by a few things. I understand that VTUNE isn't a cycle accurate simulator, but why it is that I see things like the following:
... vmovaps %zmm8, %k0, %zmm0 324.718 ms vmovaps %zmm0, %k0, %zmm8 84.007 ms vpslld $0x1f, %zmm2, %k0, %zmm1 96.931 ms vpsrld $0x01, %zmm2, %k0, %zmm2 134.087 ms vpord %zmm2, %zmm1, %k0, %zmm1 143.781 ms vmovaps %zmm9, %k0, %zmm2 245.558 ms vmovaps %zmm21, %k0, %zmm9 75.929 ms ...
As far a I can tell there are no data dependencies to prior instructions.
Q1: why are the movaps times all over the map?
Q2: why is vpssld 30% different than vpslrd?
Q3: why is there no indication of a pipeline stall on the vpord (due to the prior vpslld/vpsrld instructions)?
Q4: though I can't show it here, all my "CPU time" histogram bars in the source and assembly windows are red indicating "poor". How is a single vmovaps deemed to be "poor" (vs. Idle, Ok, Ideal, Over)?
My application is compiled with icc -g -debug extended -debug inline-debug-info -debug expr-source-pos -std=c9x -O3 -Wall -openmp -offload ...
Extra credit: It appear that the compiler (icc) is reluctant to rearrange vector instructions to avoid data dependency pipeline stalls especially if the instruction come from different source code expessions (i.e. different source lines). Is there some information as to how aggressive I should expect that compiler optimization to be?