CS 105

Optimizing Matrix Transposition

In this part of the assignment you and your partner will write a matrix-transpose function that causes as few cache misses as possible.

The Matrix Transposition

Let \( A \) denote a matrix, and \( A_{ij} \) denote the element at the \( i \)th row and \( j \)th column. The transpose of \( A \), denoted \( A^T \), is a matrix such that \( A_{ij}=A^T_{ji}. \)

To help you get started, we have given you an example transpose function in trans.c that computes the transpose of \( N \times M \) matrix \( A \) and stores the results in \( M \times N \) matrix \( B \):

char trans_desc[] = "Simple row-wise scan transpose";
void trans(int M, int N, int A[N][M], int B[M][N])

This example transpose function is correct, but it is inefficient because the access pattern results in a relatively large number of cache misses.

Your job in this part of the assignment is to write a similar function, transpose_submit, that minimizes the number of cache misses across different sized matrices:

char transpose_submit_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[N][M], int B[M][N]);

Do not change the description string Transpose submission in your transpose_submit function. The autograder searches for this string to determine which transpose function to evaluate for credit.

Coding Rules

  • All your work will be done in the file trans.c.
  • Include both partners' names and login IDs in the header comment for trans.c.
  • Your code in trans.c must compile without warnings on wilkes to receive credit.
  • You are allowed to define at most twelve local variables of type int per transpose function.

    Our testing code is not able to count references to the stack. We want you to limit your references to the stack and focus on the access patterns of the source and destination arrays. We will hand-check your code for violation of this rule!.

  • You are not allowed to sidestep the previous rule by using any variables of type long or by using any bit tricks to store more than one value to a single variable.

  • Your transpose function may not use recursion.
  • If you choose to use helper functions, you may not have more than twelve local variables on the stack at a time between your helper functions and your top-level transpose function.

    For example, if your transpose declares eight variables, and you then call a function that uses four variables which calls another function that uses two variables, you will have fourteen variables on the stack, and you will be in violation of the rule.

  • Your transpose function may not modify array A. You may, however, do whatever you want with the contents of array B.

  • You are not allowed to define any arrays in your code.
  • You are not allowed to use any variant of malloc.

Tips

Here is a list of suggestions for working on this part of the assignment.

  • The test-trans program, described below, saves the trace for function i in file trace.fi.

    Because Valgrind introduces many stack accesses that have nothing to do with your code, we have filtered out all stack accesses from the trace. This is the reason that we have banned local arrays and placed limits on the number of local variables.

    These trace files are invaluable debugging tools that can help you understand exactly where the hits and misses for each transpose function are coming from. To debug a particular function, simply run its trace through the reference simulator with the verbose option:

    ./csim-ref -v -s 5 -E 1 -b 5 -t trace.f0
    S 68312c,1 miss
    L 683140,8 miss
    L 683124,4 hit
    L 683120,4 hit
    L 603124,4 miss eviction
    S 6431a0,4 miss
    ...
    
  • Since your transpose function is being evaluated on a direct-mapped cache, conflict misses are a potential problem. Think about the potential for conflict misses in your code, especially along the diagonal. Try to think of access patterns that will decrease the number of these conflict misses.

  • Blocking is a useful technique for reducing cache misses. For more information, see the document “CS:APP2e Web Aside MEM:BLOCKING: Using Blocking to Increase Temporal Locality” on the textbook's website.

Testing with the Autograder

We have provided you with an autograding program, test-trans.c, that tests the correctness and performance of each of the transpose functions that you have registered with the autograder.

Note that the autograder must be run on a Linux machine that has the valgrind program installed. We will test wilkes, and we suggest that you also work there.

You can register up to 100 versions of the transpose function in your trans.c file. Each transpose version has the form

/* Header comment */
char trans_simple_desc[] = "A simple transpose";
void trans_simple(int M, int N, int A[N][M], int B[M][N])
{
    /* your transpose code here */
}

where simple should replaced with an appropriate descriptive word, and the quoted string should briefly describe the function. When you are testing, you can create several different versions with different names and compare their performance; after you find the best function for a particular setting, you can have your transpose_submit function just call that function and remove the others.

You can register a particular transpose function with the autograder by making a call of the form

registerTransFunction(trans_simple, trans_simple_desc);

in the registerFunctions routine in trans.c, where (in this example) trans_simple is the name of your function. At run time, the autograder will evaluate each registered transpose function and print its results.

Of course, one of the registered functions must be the transpose_submit function that you are submitting for credit:

registerTransFunction(transpose_submit, transpose_submit_desc);

See the default trans.c function for an example of how this works.

Running the Autograder

The autograder is run with two arguments that specify the matrix size. It then uses valgrind to generate a trace of each registered transpose function, and evaluates each trace by running the reference simulator on a cache with the parameters \( s=5, E=1, \mathrm{and~} b=5. \)

For example, to test your registered transpose functions on a \( 32 \times 32 \) matrix, rebuild test-trans with make and then run it with the appropriate values for \( M \) and \( N \):

./test-trans -M 32 -N 32
Step 1: Evaluating registered transpose funcs for correctness:
func 0 (Transpose submission): correctness: 1
func 1 (Simple row-wise scan transpose): correctness: 1
func 2 (column-wise scan transpose): correctness: 1
func 3 (using a zig-zag access pattern): correctness: 1

Step 2: Generating memory traces for registered transpose funcs.

Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:287, evictions:255
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151
func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151
func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945

Summary for official submission (func 0): correctness=1 misses=287

In this example, we have registered four different transpose functions in trans.c. The test-trans program tests each of the registered functions, displays the results for each, and extracts the results for the official submission.

Evaluation

For this part of the assignment, we will evaluate the correctness and performance of your transpose_submit function on three differently sized output matrices:

Note issues; details in comments.

  • \( 32 \times 32 \qquad M=32, N=32 \)

  • \( 61 \times 67 \qquad M=61, N=67 \)

  • \( 64 \times 64 \qquad M=64, N=64 \quad \textrm{(extra credit)} \)

For each matrix size, the performance of your transpose_submit function is evaluated by using Valgrind to extract the address trace for your function, and then using the reference simulator to replay this trace on a cache with parameters (\( s=5, E=1, b=5 \)).

Your performance score for each matrix size scales linearly with the number of misses, \( m \), up to some threshold:

  • \( 32 \times 32 \): 8 points if \( m < 300 \); 0 points if \( m > 600 \)' linearly from 8 to 0 if \( 300 \leq m \leq 600 \).

  • \( 61 \times 67 \): 10 points if \( m < 2000 \); 0 points if \( m > 3000 \); linearly in between (extra credit).

  • \( 64 \times 64 \): 8 points if \( m < 1300 \); 0 points if \( m > 2000 \); linearly in between.

Optimization for Each Test Case

Your code must be correct to receive any performance points for a particular size.

However,

  • Your code only needs to be correct for the three cases we're testing it on.
  • You may optimize your code specially for each case.

In particular, it is perfectly fine for your function to explicitly check the input sizes and use separate codepaths that you've optimized for each case.

To Complete This Part of the Assignment

You'll know you're done with this part of the assignment when you've done all of the following:

(When logged in, completion status appears here.)