Optimizing Matrix Transposition
In this part of the assignment you and your partner will write a matrix-transpose function that causes as few cache misses as possible.
The Matrix Transposition
Let \( A \) denote a matrix, and \( A_{ij} \) denote the element at the \( i \)th row and \( j \)th column. The transpose of \( A \), denoted \( A^T \), is a matrix such that \( A_{ij}=A^T_{ji}. \)
To help you get started, we have given you an example transpose
function in trans.c that computes the transpose of \( N \times M
\) matrix \( A \) and stores the results in \( M \times N \)
matrix \( B \):
char trans_desc[] = "Simple row-wise scan transpose";
void trans(int M, int N, int A[N][M], int B[M][N])
This example transpose function is correct, but it is inefficient because the access pattern results in a relatively large number of cache misses.
Your job in this part of the assignment is to write a similar
function, transpose_submit, that minimizes the number of cache
misses across different sized matrices:
char transpose_submit_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[N][M], int B[M][N]);
Do not change the description string Transpose submission in
your transpose_submit function. The autograder searches for this
string to determine which transpose function to evaluate for credit.
Coding Rules
- All your work will be done in the file
trans.c. - Include both partners' names and login IDs in the header comment
for
trans.c. - Your code in
trans.cmust compile without warnings onwilkesto receive credit. -
You are allowed to define at most twelve local variables of type
intper transpose function.Our testing code is not able to count references to the stack. We want you to limit your references to the stack and focus on the access patterns of the source and destination arrays. We will hand-check your code for violation of this rule!. -
You are not allowed to sidestep the previous rule by using any variables of type
longor by using any bit tricks to store more than one value to a single variable. - Your transpose function may not use recursion.
-
If you choose to use helper functions, you may not have more than twelve local variables on the stack at a time between your helper functions and your top-level transpose function.
For example, if your transpose declares eight variables, and you then call a function that uses four variables which calls another function that uses two variables, you will have fourteen variables on the stack, and you will be in violation of the rule.
-
Your transpose function may not modify array
A. You may, however, do whatever you want with the contents of arrayB. - You are not allowed to define any arrays in your code.
- You are not allowed to use any variant of
malloc.
Tips
Here is a list of suggestions for working on this part of the assignment.
-
The
test-transprogram, described below, saves the trace for functionin filei trace.f.i Because Valgrind introduces many stack accesses that have nothing to do with your code, we have filtered out all stack accesses from the trace. This is the reason that we have banned local arrays and placed limits on the number of local variables.These trace files are invaluable debugging tools that can help you understand exactly where the hits and misses for each transpose function are coming from. To debug a particular function, simply run its trace through the reference simulator with the verbose option:
./csim-ref -v -s 5 -E 1 -b 5 -t trace.f0 S 68312c,1 miss L 683140,8 miss L 683124,4 hit L 683120,4 hit L 603124,4 miss eviction S 6431a0,4 miss ... -
Since your transpose function is being evaluated on a direct-mapped cache, conflict misses are a potential problem. Think about the potential for conflict misses in your code, especially along the diagonal. Try to think of access patterns that will decrease the number of these conflict misses.
- Blocking is a useful technique for reducing cache misses. For more information, see the document “CS:APP2e Web Aside MEM:BLOCKING: Using Blocking to Increase Temporal Locality” on the textbook's website.
Testing with the Autograder
We have provided you with an autograding program, test-trans.c,
that tests the correctness and performance of each of the transpose
functions that you have registered with the autograder.
Note that the autograder must be run on a Linux machine that has the
valgrind program installed. We will test wilkes, and we
suggest that you also work there.
You can register up to 100 versions of the transpose function in your
trans.c file. Each transpose version has the form
/* Header comment */
char trans_simple_desc[] = "A simple transpose";
void trans_simple(int M, int N, int A[N][M], int B[M][N])
{
/* your transpose code here */
}
where simple should replaced with an appropriate descriptive word,
and the quoted string should briefly describe the function. When
you are testing, you can create several different versions with
different names and compare their performance; after you find the
best function for a particular setting, you can have your
transpose_submit function just call that function and remove the
others.
You can register a particular transpose function with the autograder by making a call of the form
registerTransFunction(trans_simple, trans_simple_desc);
in the registerFunctions routine in trans.c, where (in this
example) trans_simple is the name of your function. At run time,
the autograder will evaluate each registered transpose function and
print its results.
Of course, one of the registered functions must be the
transpose_submit function that you are submitting for credit:
registerTransFunction(transpose_submit, transpose_submit_desc);
See the default trans.c function for an example of how this works.
Running the Autograder
The autograder is run with two arguments that specify the matrix size. It
then uses valgrind to generate a trace of each registered transpose
function, and evaluates each trace by running the reference simulator on
a cache with the parameters \( s=5, E=1, \mathrm{and~} b=5. \)
For example, to test your registered transpose functions on a \( 32
\times 32 \) matrix, rebuild test-trans with make and then run
it with the appropriate values for \( M \) and \( N \):
./test-trans -M 32 -N 32
Step 1: Evaluating registered transpose funcs for correctness:
func 0 (Transpose submission): correctness: 1
func 1 (Simple row-wise scan transpose): correctness: 1
func 2 (column-wise scan transpose): correctness: 1
func 3 (using a zig-zag access pattern): correctness: 1
Step 2: Generating memory traces for registered transpose funcs.
Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:287, evictions:255
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151
func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151
func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945
Summary for official submission (func 0): correctness=1 misses=287
In this example, we have registered four different transpose functions
in trans.c. The test-trans program tests each of the registered
functions, displays the results for each, and extracts the results for
the official submission.
Evaluation
For this part of the assignment, we will evaluate the correctness and performance of your transpose_submit function on three differently sized output matrices:
Optimization for Each Test Case
Your code must be correct to receive any performance points for a particular size.
However,
- Your code only needs to be correct for the three cases we're testing it on.
- You may optimize your code specially for each case.
In particular, it is perfectly fine for your function to explicitly check the input sizes and use separate codepaths that you've optimized for each case.
(When logged in, completion status appears here.)
Note issues; details in comments.
\( 32 \times 32 \qquad M=32, N=32 \)
\( 61 \times 67 \qquad M=61, N=67 \)
\( 64 \times 64 \qquad M=64, N=64 \quad \textrm{(extra credit)} \)
For each matrix size, the performance of your
transpose_submitfunction is evaluated by using Valgrind to extract the address trace for your function, and then using the reference simulator to replay this trace on a cache with parameters (\( s=5, E=1, b=5 \)).Your performance score for each matrix size scales linearly with the number of misses, \( m \), up to some threshold:
\( 32 \times 32 \): 8 points if \( m < 300 \); 0 points if \( m > 600 \)' linearly from 8 to 0 if \( 300 \leq m \leq 600 \).
\( 61 \times 67 \): 10 points if \( m < 2000 \); 0 points if \( m > 3000 \); linearly in between (extra credit).
\( 64 \times 64 \): 8 points if \( m < 1300 \); 0 points if \( m > 2000 \); linearly in between.