PA 3

Due: (strict deadline) Thursday, March 7th 2013 by 9:59 pm.

COMPETITION: The Group with the best performing code will be awarded a Grand Prize!!

Problem

For this assignment, you will optimize a routine to multiply two double-precision square matrices. As discussed in class, the naive implementation is short, sweet, and horrifyingly slow. A naive blocked code is only marginally better. You will need to use what you have learned about tuning to get your code to run as fast as possible on a single core on one node of the TACC Stampede supercomputer.

We provide:

A trivial unoptimized implementation and simple blocked implementation
A timing harness and tester
A version of the interface that calls the ATLAS BLAS

Implementation

Your function must have the following signature:

void square_dgemm(unsigned M, const double* A, const double* B,
                  double* C);

The three arrays should be interpreted as matrices in column-major order with leading dimension M. The operation implemented will actually be a multiply-add:

C := C + A*B

The necessary files are in PA3.tar

Included are:

Makefile: a sample Makefile with some basic rules
Makefile.in: default settings for optimization flags and the like
matmul.c: the driver program
basic_dgemm.c: a very simple square_dgemm implementation
blocked_dgemm.c: a slightly more complex square_dgemm implementation
blas_dgemm.c: a wrapper that lets the C driver program call the dgemm routine in a tuned BLAS implementation
f2c_dgemm.c: a wrapper that illustrates how to call the reference FORTRAN dgemm routine from C.
fdgemm.f: the reference FORTRAN dgemm called by f2c_dgemm.c
tplot.sh: a sample script that uses gnuplot to plot timing results. If the raw output is in (for example) timing-basic.out, run "./tplot.sh basic" to generate timing-basic.pdf.
make_sge.sh: a helper shell script that generates SGE batch system submission scripts for matmul timing runs.

We will be testing on the compute node on the Stampede. Each node has 2 octal-core chips, but you will only be using a single core for this assignment.

Instructions on Stempede

Use the follwing command to include MKL in the PATH: export MKLROOT=/opt/apps/intel/13/composer_xe_2013.1.117/mkl

Submission

Your group should submit your dgemm.c, your Makefile (which should include any overrides to the default Makefile.in so that we can see compiler optimizations) and a write-up. Your write-up should contain:

a description of optimizations used or attempted
the results of those optimizations
your explanations for any odd behavior (e.g. performance dips)
All the relevant plots.

To document the effect of your optimizations, include a graph comparing your code with basic_dgemm.c. Your explanations should rely heavily on your knowledge of the memory hierarchy (benchmark graphs help).

General strategy

Roughly speaking, a fast matrix multiplication routine will likely have two ingredients: a fast "kernel" capable of multiplying small matrices quickly, and then routines that use the kernel in a cache-friendly way to multiply larger matrices. One way to approach the assignment is top-down: sketch out the kernel first, and make sure you understand the global structure of the calls you want to set up, then start tuning parameters. Another way is bottom-up: build a fast kernel, play with parameters until you're happy with it; then build a matrix multiply for small-ish matrices (things that fit in L1 cache) out of the kernel, and play with parameters until happy; and then repeat for the L2 (and maybe L3) cache. I strongly recommend the second path, if only for the sake of sanity while debugging! It's easier to focus on getting a small thing right first than to try to get it all right on the first try.

Where arrays live in C

There are different types of places where variables can live in memory:

Automatic variables (the type you declare at the start of a function) live on the stack.
Global variables and static variables (defined with the static keyword in a function) live in the global space.
Constants often live in their own constant space.
Dynamically allocated data lives on the heap.

In a lot of day-to-day stuff, we use the stack for almost everything. But the stack is a finite resource! If you start allocating large arrays on the stack, you will very quickly have a stack overflow (and a program crash). This is probably not what you want, so I suggest using the following strategy if you want a little side buffer for copy optimization:

void my_function(...) 
{ 
    /* static means that this goes in the global segment 
     * (only one copy -- not thread safe!), and the alignment 
     * attribute is helpful if you're going to use my_space 
     * with SSE stuff.
     */ 
    static double my_space[BLOCK_SIZE*BLOCK_SIZE] 
        __attribute__((aligned(16))); 
    ... 
}

You can also dynamically allocate space on the heap -- but you won't want to do that anywhere near an inner loop!

Timing woes

Timing on SMP Linux systems is a pain for two reasons:

The real time clock measures total time for a run, including time given to other processes.
The per-CPU clocks may differ, so that if your process gets scheduled on multiple processors, your timing gets goofy.

The code I have given you should usually work, but a more reliable way may be to define CLOCK as CLOCK_PROCESS_CPUTIME_ID in timing.c and to use taskset to ensure that your process is pinned to a single physical processor: i.e. use taskset -c 0 ./matmul (or change $1 >> timing.raw to taskset -c 0 $1 >> timing.raw in make_sge.sh.

Competition notes:

We will care about matrix multiply in the range of problems starting with those that do not fit in L2 to those which do not fit in L3. The exact sizes will not be given, but do expect that some tests will be on odd sized matrixes. We will also check very small matrix sizes as well. You may need two versions of the code one for smaller matrix sizes and one of bigger matrix sizes. All matrices will be square. We will test on TACC Stampede.
You are free to use any vector intrinsics available. You are free to prefetch.
We will compile the code with gcc -O3.

Further Resources

How To Write Fast Numerical Code: A Small Introduction (Note: how to write C code for modern compilers and memory hierarchies, so that it runs fast.) {Must read pages: 38-45}
Goto, K., and van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12. (Note: explains the design decisions for the GotoBLAS dgemm implementation, which also apply to your code.)
You can find out more about the processor by running Todd Allen's cpuid utility and by running cat /proc/cpuinfo. Note that you should do this using the batch queueing system, since the front-end node has a different processor than the workers.
What Every Programmer Should Know About Memory by Ulrich Drepper is a 114 page description of what Drepper considers to be memory basics. Good information on everything from the physics of memory hardware up through the logical organization
The optimizations used in PhiPAC and ATLAS may be interesting. Note: You cannot use PHiPAC or ATLAS to generate your matrix multiplication kernels. You can write your own code generator, however. You might want to skim the tech report on PhiPAC. The section on C coding guidelines will be particularly relevant.
There is a classic paper on the idea of cache blocking in the context of matrix multiply by Monica Lam, et al.
Several folks have tried to automatically get cache locality by basing their matrix storage organization around space-filling curves. See the paper by Chatterjee and the classic paper by Gustavson, "Recursion leads to automatic variable blocking for dense linear algebra". I don't necessarily recommend these optimizations on a time budget, but they make interesting reading.

Programming Assignment 3