Exctracting ILP

In this lab session, you will learn the black magic of optimization for instruction-level parallelism using Intel Architecture Code Analyser.

You may work in teams of two .

You are required to do this assignment on Stampede Intel Sandy Bridge and also on the MIC

For this assignment you will need a compiler which supports Nehalem SIMD extensions and optimization options.

Next, you will learn how to build C++ programs which use intrinsic functions. Download the tarball for this lab:

  wget http://http://progforperf.github.com/LS4.tar
  tar xvf LS4.tar

To simplify the build process for this assignment we have provided a Makefile. The Makefile depends on the environment variables CXX and CXXFLAGS which specify the C++ compiler and compilation flags. On Linux, you use special compiler options to tell the compiler that your program uses SSE intrinsic functions; otherwise the compiler will not recognize them. In gcc the compiler option to allow SSE4.1 intrinsics, which are required for this lab, is -msse4.1 (this option also implies -msse, -msse2, -msse3, and -mmmx, and thus enables MMX and SSE1-SSE4.1 intrinsics). However, we recommend to instead use the -march=corei7 option which tells the compiler to enable all intrinsics supported on Nehalem CPUs, which are commonly known as Core-i7. The recommended shell command to set CXXFLAGS for gcc and clang is

  export CXXFLAGS="-O2 -g -march=corei7 -mtune=corei7"

The -O2 option tells the compiler to optimize the code for speed, but not to use speculative optimization techniques. Such techniques, enabled by the -O3 option, are more “aggressive”—though designed to increase performance, they can actually decrease it as well. The -g option tells the compiler to include debug information in the compiled program. The -march=corei7 option enables all intrinsics supported on Nehalem, and -mtune=corei7 asks the compiler to optimize the program specifically for Nehalem (e.g. by rescheduling instructions so that they run faster on Nehalem). Some other meaningful options for -march and -mtune options appear below. (But note these are not suitable for use on Jinx.)

core2 for Intel Core 2 and Xeon processors of the same generation (Harpertown, Clovertown). For 45nm Core 2. processors (and Harpertown Xeons) you should also specify -msse4.1 because they additionally support SSE4.1 instruction set.
corei7-avx for Intel Sandy Bridge processors (aka second-generation Core-i7).
core-avx-i for Intel Ivy Bridge processors (aka third-generation Core-i7).
bdver2 for AMD Trinity APUs (Piledriver core).
bdver1 for AMD Bulldozer processors (AMD-FX processors, Opteron 3200, 4200, 6200 series).
amdfam10 for AMD K10 (most pre-buildozer CPUs from AMD, including Opteron 4100 and 6100 series and Phenom processors).
atom for Intel Atom processors.
btver1 for AMD Bobcat processors (E–350 and alike).
native for the CPU you compile on.

When debugging your program you might also want to add -DDEBUG flag. With -DDEBUG the program will run fewer iterations, and will report the results sooner. Do not use it for your final timing runs.

  export CXXFLAGS="-O2 -g -march=corei7 -mtune=corei7 -DDEBUG"

For this assignment you must use gcc. To compile with gcc, set CXX as

  export CXX=g++

After you set the CXX and CXXFLAGS options as described above, you may build the program by typing

  make

Part 1: Timing the matrix multiplication

In the tarball we provided a program which computes pairwise products of 4x4 single-precision floating point matrices in arrays, producing an array of product matrices.

The compiled program will be named matmult. The program does the following.

It creates two random arrays of 4x4 single-precision floating-point matrices.
It computes a reference array of pairwise matrix products with a simple algorithm.
It measures time of a number of simple implementations, two SSE1 implementations, SSE3 implementation, and SSE4.1 implementation, and reports the number of cycles per matrix (CPE) for each version.
It checks the correstness of each version by computing the so-called “L1-error” between two arrays of matrix products.

For this part you have to:

Run the matmult program.
Record the number of cycles per matrix multiplication.
Answer the following questions:
1. Which compiler and which compilation flags you used?
2. Does the program run faster if we use higher versions of SSE instructions?
3. What fraction of peak single-core FLOPS the best version achieves? Remember that the processor is able to run both SSE FP multiplication and SSE FP addition instructions every cycle.

Part 2: Using Intel Architecture Code Analyser

  wget http://progforperf.github.com/iaca-lin32.zip
  gunzip iaca-lin32.zip

In this part you will learn how to use Intel Architecture Code Analyzer (IACA) to improve the performance of your program.

IACA reports how the instructions in your program map to execution ports inside the CPU, and gives you some implicit tips about optimization. To analyze your code with IACA you have to insert special markers in it. The markers are defined in the header file iacaMarks.h, which is already included in matmult.cpp. Insert the IACA_START marker before the main loop in the function matrix4x4_mul_sse_scalar_load, and IACA_END marker immediately after the loop. Compile the program. IACA markers are not valid x86 instructions. If you compile and run a program with such markers, the program may crash. Run IACA with

  iaca -64 -arch NHM -o matmul-sse-scalar-load.txt matmult.o

The -64 parameter tells the IACA to analyse the 64-bit program, -arch NHM means that IACA should produce analysis for Nehalem architecture, -o matmul-sse-scalar-load.txt is to write the output into the matmul-sse-scalar-load.txt file.

Now remove the IACA markers in the matrix4x4_mul_sse_scalar_load function, and add them before and after the main loop in the matrix4x4_mul_sse function. Run IACA with parameters

  iaca -64 -arch NHM -o matmul-sse.txt matmult.o

Inspect the matmul-sse-scalar-load.txt and matmul-sse.txt files. They list the disassembly of the code between IACA_START and IACA_END, the number of macrooperations (fused microoperations) for each instruction, and the exection ports they bind to. Notice which CPU resource is the bottleneck.

Now remove all IACA marks in matmult.cpp, and consider the function matrix4x4_mul_sse_optimized. Initially it is the same as matrix4x4_mul_sse function. Now replace the computation of c2 in matrix4x4_mul_sse_optimized with the code from matrix4x4_mul_sse_scalar_load. Compile, and run.

Answer the following questions:

Which CPU resource is the bottleneck in matrix4x4_mul_sse function?
Which CPU resource is the bottleneck in matrix4x4_mul_sse_scalar_load function?
Which function achieves higher performance, matrix4x4_mul_sse or matrix4x4_mul_sse_scalar_load?
Which function achieves higher performance, matrix4x4_mul_sse or matrix4x4_mul_sse_optimized?
Explain why the changes you made to matrix4x4_mul_sse_optimized are good for performance.