Quantcast
Channel: Recent posts
Viewing all articles
Browse latest Browse all 20

VTUNE and VPU hotspot analysis

$
0
0

I'm using VTUNE to look at the hotspots in my code.  I'm down into the vector instructions and confused by a few things.  I understand that VTUNE isn't a cycle accurate simulator, but why it is that I see things like the following:

   ...
   vmovaps %zmm8, %k0, %zmm0                324.718 ms
   vmovaps %zmm0, %k0, %zmm8                 84.007 ms
   vpslld $0x1f, %zmm2, %k0, %zmm1           96.931 ms
   vpsrld $0x01, %zmm2, %k0, %zmm2          134.087 ms
   vpord %zmm2, %zmm1, %k0, %zmm1           143.781 ms
   vmovaps %zmm9, %k0, %zmm2                245.558 ms
   vmovaps %zmm21, %k0, %zmm9                75.929 ms
   ...

As far a I can tell there are no data dependencies to prior instructions.

Q1: why are the movaps times all over the map?

Q2: why is vpssld 30% different than vpslrd?

Q3: why is there no indication of a pipeline stall on the vpord (due to the prior vpslld/vpsrld instructions)?

Q4: though I can't show it here, all my "CPU time" histogram bars in the source and assembly windows are red indicating "poor".  How is a single vmovaps deemed to be "poor" (vs. Idle, Ok, Ideal, Over)?

My application is compiled with icc -g -debug extended -debug inline-debug-info -debug expr-source-pos -std=c9x -O3 -Wall -openmp -offload ...

Extra credit: It appear that the compiler (icc) is reluctant to rearrange vector instructions to avoid data dependency pipeline stalls especially if the instruction come from different source code expessions (i.e. different source lines).  Is there some information as to how aggressive I should expect that compiler optimization to be?


Viewing all articles
Browse latest Browse all 20

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>