In this lab session, you will learn the black magic of optimization for instruction-level parallelism using Intel Architecture Code Analyser.
You may work in teams of two .
You are required to do this assignment on Stampede Intel Sandy Bridge and also on the MIC
For this assignment you will need a compiler which supports Nehalem SIMD extensions and optimization options.
Next, you will learn how to build C++ programs which use intrinsic functions. Download the tarball for this lab:
wget http://http://progforperf.github.com/LS4.tar
tar xvf LS4.tar
To simplify the build process for this assignment we have provided a Makefile. The Makefile depends on the environment variables CXX
and CXXFLAGS
which specify the C++ compiler and compilation flags. On Linux, you use special compiler options to tell the compiler that your program uses SSE intrinsic functions; otherwise the compiler will not recognize them. In gcc the compiler option to allow SSE4.1 intrinsics, which are required for this lab, is -msse4.1
(this option also implies -msse
, -msse2
, -msse3
, and -mmmx
, and thus enables MMX and SSE1-SSE4.1 intrinsics). However, we recommend to instead use the -march=corei7
option which tells the compiler to enable all intrinsics supported on Nehalem CPUs, which are commonly known as Core-i7. The recommended shell command to set CXXFLAGS
for gcc and clang is
export CXXFLAGS="-O2 -g -march=corei7 -mtune=corei7"
The -O2
option tells the compiler to optimize the code for speed, but not to use speculative optimization techniques. Such techniques, enabled by the -O3
option, are more “aggressive”—though designed to increase performance, they can actually decrease it as well. The -g
option tells the compiler to include debug information in the compiled program. The -march=corei7
option enables all intrinsics supported on Nehalem, and -mtune=corei7
asks the compiler to optimize the program specifically for Nehalem (e.g. by rescheduling instructions so that they run faster on Nehalem). Some other meaningful options for -march
and -mtune
options appear below. (But note these are not suitable for use on Jinx.)
core2
for Intel Core 2 and Xeon processors of the same generation (Harpertown, Clovertown). For 45nm Core 2. processors (and Harpertown Xeons) you should also specify -msse4.1
because they additionally support SSE4.1 instruction set.corei7-avx
for Intel Sandy Bridge processors (aka second-generation Core-i7).core-avx-i
for Intel Ivy Bridge processors (aka third-generation Core-i7).bdver2
for AMD Trinity APUs (Piledriver core).bdver1
for AMD Bulldozer processors (AMD-FX processors, Opteron 3200, 4200, 6200 series).amdfam10
for AMD K10 (most pre-buildozer CPUs from AMD, including Opteron 4100 and 6100 series and Phenom processors).atom
for Intel Atom processors.btver1
for AMD Bobcat processors (E–350 and alike).native
for the CPU you compile on.When debugging your program you might also want to add -DDEBUG
flag. With -DDEBUG
the program will run fewer iterations, and will report the results sooner. Do not use it for your final timing runs.
export CXXFLAGS="-O2 -g -march=corei7 -mtune=corei7 -DDEBUG"
For this assignment you must use gcc. To compile with gcc, set CXX as
export CXX=g++
After you set the CXX
and CXXFLAGS
options as described above, you may build the program by typing
make
In the tarball we provided a program which computes pairwise products of 4x4 single-precision floating point matrices in arrays, producing an array of product matrices.
The compiled program will be named matmult
. The program does the following.
For this part you have to:
matmult
program.Answer the following questions:
Which compiler and which compilation flags you used?
Does the program run faster if we use higher versions of SSE instructions?
What fraction of peak single-core FLOPS the best version achieves? Remember that the processor is able to run both SSE FP multiplication and SSE FP addition instructions every cycle.
wget http://progforperf.github.com/iaca-lin32.zip
gunzip iaca-lin32.zip
In this part you will learn how to use Intel Architecture Code Analyzer (IACA) to improve the performance of your program.
IACA reports how the instructions in your program map to execution ports inside the CPU, and gives you some implicit tips about optimization. To analyze your code with IACA you have to insert special markers in it. The markers are defined in the header file iacaMarks.h
, which is already included in matmult.cpp
. Insert the IACA_START
marker before the main loop in the function matrix4x4_mul_sse_scalar_load
, and IACA_END
marker immediately after the loop. Compile the program. IACA markers are not valid x86 instructions. If you compile and run a program with such markers, the program may crash. Run IACA with
iaca -64 -arch NHM -o matmul-sse-scalar-load.txt matmult.o
The -64
parameter tells the IACA to analyse the 64-bit program, -arch NHM
means that IACA should produce analysis for Nehalem architecture, -o matmul-sse-scalar-load.txt
is to write the output into the matmul-sse-scalar-load.txt
file.
Now remove the IACA markers in the matrix4x4_mul_sse_scalar_load
function, and add them before and after the main loop in the matrix4x4_mul_sse
function. Run IACA with parameters
iaca -64 -arch NHM -o matmul-sse.txt matmult.o
Inspect the matmul-sse-scalar-load.txt
and matmul-sse.txt
files. They list the disassembly of the code between IACA_START and IACA_END, the number of macrooperations (fused microoperations) for each instruction, and the exection ports they bind to. Notice which CPU resource is the bottleneck.
Now remove all IACA marks in matmult.cpp
, and consider the function matrix4x4_mul_sse_optimized
. Initially it is the same as matrix4x4_mul_sse
function. Now replace the computation of c2
in matrix4x4_mul_sse_optimized
with the code from matrix4x4_mul_sse_scalar_load
. Compile, and run.
Answer the following questions:
Which CPU resource is the bottleneck in matrix4x4_mul_sse
function?
Which CPU resource is the bottleneck in matrix4x4_mul_sse_scalar_load
function?
Which function achieves higher performance, matrix4x4_mul_sse
or matrix4x4_mul_sse_scalar_load
?
Which function achieves higher performance, matrix4x4_mul_sse
or matrix4x4_mul_sse_optimized
?
Explain why the changes you made to matrix4x4_mul_sse_optimized
are good for performance.