Project 2B
Lock Granularity and Performance
INTRODUCTION:
In the previous project the mutex and spin-lock proved to be bottlenecks,
preventing parallel access to the list.
In this project, you will do additional performance instrumentation to confirm this,
and extend your previous solution to deal with this problem.
This project can be broken up into a few major steps:
- Do performance instrumentation and measurement to confirm the cause of the problem.
- Implement a new option to divide a list into sublists and support synchronization
on sublists, thus allowing parallel access to the (original) list.
- Do new performance measurements to confirm that the problem has been solved.
RELATION TO READING AND LECTURES:
Partitioned lists and finer granularity locking are discussed in sections 29.2-4
PROJECT OBJECTIVES:
- primary: demonstrate the ability to recognize bottlenecks on large data structures
- primary: experience with partitioning a serialized resource to improve parallelism
- primary: experience with basic performance measurement and instrumentation
- primary: experience with execution profiling
- secondary: experience with finding, installing, and exploiting new development tools
DELIVERABLES:
A single tarball (.tar.gz) containing:
PREPARATION:
To perform this assignment, you will need to research, choose, install and master a
multi-threaded execution profiling tool. Execution profiling is a combination of
compile-time and run-time tools that analyze a program's execution to determine
how much time is being spent in which routines.
There are a few well-known tools in this space:
- perf ... a state-of-the-art Linux tool that can interpret call-stacks
and exploit hardware performance counters (e.g. instructions, cache misses).
It is often installed by default on Linux development desktops (from the
linux-tools package). The only issue with this (powerful)
tool is that the annotate option (which shows the time spent on
each statement) breaks this down to the level of assembly language
instructions ... making the output a bit harder to read.
- gperftools ... a wonderful set of performance optimization tools from Google.
It includes a profiler that is quite similar to gprof (in that it samples real execution),
but is MT-safe.
- valgrind ... best known for its memory leak detector, which has an interpreted
execution engine that can extract a great deal of information about where
CPU time is being spent, even estimating cache misses.
It does work for multi-threaded programs, but its interpreter does not provide
much parallelism. As such it is not useful for examining high contention situations.
- gprof(1) ... an older Linux tool, quite simple to use, but its
call-counting mechanism is not-multi-thread safe, and its execution sampling is
not multi-thread aware.
As such, it is not usable for analyzing the performance of multi-threaded applications.
This project is about scalable parallelism, which is only possible on a processor with many cores.
You can do most of your development and testing on any Linux system, but if your personal computer
does not have more than a few cores (e.g. 8-12), you will not be able to do real
multi-threaded performance testing on it.
Lab servers are available if you need access to a larger machine for your final testing,
graph production and performance analysis.
Different people may use different profiling tools, some of which may not be
available on our test machines. For this reason, we will not run your profile
rule during testing. Rather we will merely review your submitted profiling report
and confirm that it corresponds to your code.
PLATFORM REQUIREMENTS:
Some of these tests (particularly spin-locks with large numbers of threads)
will consume a great many cycles. Most desktop systems can probably
complete all of these tests in one or two minutes, but a weaker system
might require the better part of an hour for a single set of runs.
PROJECT DESCRIPTION:
Review your results from the previous lab (lab2_add-5.png and lab2_list-4.png)
for the cost (per synchronized operation) vs. the number of threads.
For Compare-and-Swap and Mutex,
we saw that adding threads took us from tens of ns per operation to small hundreds of ns per operation.
Looking at the analogous results in Part 2, we see the (un-adjusted for length) time per operation go
from a few microseconds, to tens of microseconds.
For the adds, moderate contention added ~100ns to each synchronization operation.
For the list operations, moderate contention added ~10us to each synchronization operation.
This represents a 100x difference in the per operation price of lock contention.
In a single-threaded implementation, time per operation is a very reasonable performance
metric. But for multi-threaded implemenations we are also concerned about how well
we are taking advantage of the available parallelism. This is more clearly seen
in the aggregate throughput (total operations per second for all threads combined).
Run your list exerciser with:
- Mutex synchronized list operations, 1,000 iterations, 1,2,4,8,12,16,24 threads
- Spin-lock synchronized list operations, 1,000 iterations, 1,2,4,8,12,16,24 threads
Capture the output, and produce a plot of the total number of operations per second for each
synchronization method.
In the previous lab, we gave you all of the necessary data reduction scripts.
In this lab, you will have to create your own ... but you can use the scripts
provided in the previous lab as a starter.
- To get the throughput, divide one billion (number of nanoseconds in a second)
by the time per operation (in nanoseconds). You can include this arithmetic
in your instructions to gnuplot, where rather than saying
plot ... using($2):($7) ...
you can say
plot ... using($2):(1000000000/$7) ...
- Previously, we multiplied the number of operations by half of the mean list
length (to account for the longer searches in longer lists).
This time, we are focusing on synchronization costs ... and
our synchronization is implemented, not per-list-element, but per-operation.
Thus, in this analysis, we should use the raw time per operation,
and not correct our times for the list length.
Submit this graph as lab2b_1.png.
You do not need to go back to your lab2a_list program to generate this data, because
the lab2b_list program (even after adding all the new features) should still be able
to generate essentially the same results.
The most obvious difference, which we already knew:
- Spin-locks waste increasingly more CPU time as the probability of contention increases.
But these throughput lines show us something that was not as obvious in the cost per operation graphs:
- Whereas add throughput quickly leveled off ...
we had saturated the CPU and the overhead of synchronization
seemed to increase only very slowly.
- The list operation throughput continues to drop, as the overhead of
synchronization increases with the number of threads - and much worse for spin-locks.
The obvious conclusions (from both the cost-per-operation graphs you produced previously, and the
throughput graphs you have just produced) are:
- The throughput of parallel synchronized linked list operations does not scale as well as
the throughput of parallel synchronized add operations.
- The reduction in throughput with increasing parallelism is due to an increasing time per operation.
Since the code inside the critical section does not change with the number of threads, it seems reasonable to assume that the added execution time is being spent getting the locks.
QUESTION 2.3.1 - CPU time in the basic list implementation:
Where do you believe most of the CPU time is spent in the 1 and 2-thread list tests ?
Why do you believe these to be the most expensive parts of the code?
Where do you believe most of the CPU time is being spent in the high-thread spin-lock tests?
Where do you believe most of the CPU time is being spent in the high-thread mutex tests?
It should be clear why the spin-lock implementation performs so badly with a large number of threads.
But the mutex implementation should not have this problem.
You may have some theories about why these algorithms scale so poorly.
But theories are only theories, and the prime directive of performance analysis is that theoretical
conclusions must be confirmed by hard data.
Execution Profiling
Build your program with debug symbols, choose your execution profiling package, install it, run the spin-lock list test (1,000 iterations 12 threads) under the profiler, and analyze the results to determine where the CPU time is being spent.
The default output from google-pprof will show you which routine is consuming most of the CPU time.
- If you then re-run google-pprof with the --list option (specifying that routine),
it will give you a source-level breakdown of how much time is being spent on each instruction.
- the perf annotate option will give you a similar breakdown for every line of
source code.
You should get a very clear answer to the question of where the program is spending its time.
Update your Makefile to run this test and generate the results automatically (make profile),
include your profiling report (in a file named profile.out)
in your submitted tarball, and identify it in your README file.
QUESTION 2.3.2 - Execution Profiling:
Where (what lines of code) are consuming most of the CPU time when the spin-lock version
of the list exerciser is run with a large number of threads?
Why does this operation become so expensive with large numbers of threads?
Timing Mutex Waits
In the spin-lock case, profiling can tell us where we are spending most of our time.
In the mutex case, we are not spinning. A thread that cannot get the lock is blocked, and not consuming any CPU time.
Profiling only tells us what code we are executing.
It doesn't tell us anything about the time we spend blocked.
How could we confirm that, in the mutex case, most threads are spending most of their time waiting for a lock?
Update your mutex and spin-lock implementations to:
- Note the high-resolution
(clock_gettime CLOCK_MONOTONIC)
time before and after getting the lock,
compute the elapsed difference, and add that to a (per-thread) total.
- When the program completes, add up the total lock acquisition time
(for all threads) and divide it by the number of lock operations to
compute an average wait-for-lock, and add this number,
as a new (8th) column, to the output statistics for the run.
If you are not locking, this number should always be zero.
Run the list mutex test again for 1,000 iterations and 1, 2, 4, 8, 16, 24 threads,
and plot the wait-for-lock time, and the average time per operation against the
number of competing threads. Submit this graph lab2b_2.png.
QUESTION 2.3.3 - Mutex Wait Time:
Look at the average time per operation (vs. # threads)
and the average wait-for-mutex time (vs. #threads).
- Why does the average lock-wait time rise so dramatically
with the number of contending threads?
- Why does the completion time per operation rise
(less dramatically) with the number of contending threads?
- How is it possible for the wait time per operation to go
up faster (or higher) than the completion time per operation?
Addressing the Underlying Problem
While the details of how contention degrades throughput are different for mutex and spin-lock synchronization,
all of the degradation is the result of increased contention.
This is the fundamental problem with coarse-grained synchronization.
The classic solution to this problem is to partition the single resource
(in this case a linked list) into multiple independent resources, and divide the requests among those sub-resources.
Add a new --lists=# option to your lab2_list program:
- break the single (huge) sorted list into the specified number of sub-lists (each with its own list header and synchronization object).
- change your lab2_list program to select which sub-list a particular key should be in based on a simple hash of the key, modulo the number of lists.
- figure out how to (safely and correctly) obtain the length of the list, which now involves enumerating all of the sub-lists.
- each thread:
- starts with a set of pre-allocated and initialized elements
(--iterations=#)
- inserts them all into the multi-list (which sublist the key
should go into determined by a hash of the key)
- gets the list length
- looks up and deletes each of the keys it inserted
- exits to re-join the parent thread
- Include the number of lists as the fourth number
(always previously 1) in the output statistics report.
The supported command line options and expected output are illustrated below:
% ./lab2_list --threads=10 --iterations=1000 --lists=5 --yield=id --sync=m
list-id-m,10,1000,5,30000,85811144,2860,20580
%
The updated set of output fields for this project should be:
- the name of the test, which is of the form:
list-yieldopts-syncopts: where
- yieldopts = {none, i,d,l,id,il,dl,idl}
- syncopts = {none,s,m}
- the number of threads (from --threads=)
- the number of iterations (from --iterations=)
- the number of lists (from --lists=)
- the total number of operations performed:
threads x iterations x 3 (insert + lookup + delete)
- the total run time (in nanoseconds) for all threads
- the average time per operation (in nanoseconds)
- the average wait-for-lock time (in nanoseconds)
Confirm the correctness of your new implementation:
- Run your program with --yield=id,
4 lists, 1,4,8,12,16 threads, and 1, 2, 4, 8, 16 iterations
(and no synchronization) to see how many iterations it takes
to reliably fail (and make sure your Makefile
expects some of these tests to fail).
- Run your program with --yield=id, 4 lists,
1,4,8,12,16 threads, and 10, 20, 40, 80 iterations,
--sync=s and --sync=m to confirm that
updates are now properly protected
(i.e., all runs succeeded).
- Plot these results (as you did last week) and submit the
result as lab2b_3.png.
Now that we believe the partitioned lists implementation works, we can measure its performance:
- Rerun both synchronized versions, without yields, for 1000 iterations,
1,2,4,8,12 threads, and 1,4,8,16 lists.
- For each synchronization mechanism, graph the aggregated throughput
(total operations per second, as you did for lab2a_1.png)
vs. the number of threads, with a separate curve for each number of lists
(1,4,8,16).
Call these graphs lab2b_4.png(symc=m) and
lab2b_5.png(sync=s).
QUESTION 2.3.4 - Performance of Partitioned Lists
- Explain the change in performance of the synchronized
methods as a function of the number of lists.
- Should the throughput continue increasing as the number
of lists is further increased? If not, explain why not.
- It seems reasonable to suggest the throughput of an N-way
partitioned list should be equivalent to the throughput of
a single list with fewer (1/N) threads.
Does this appear to be true in the above curves?
If not, explain why not.
SUMMARY OF EXIT CODES:
- 0: successful run
- 1: invalid argument or system call error
- 2: other failures ... e.g. list operation failures due to conflicting updates
SUBMISSION:
Your README file must include lines of the form:
NAME: your name
EMAIL: your email
ID: your student ID
And, if slip days are allowed on this project, and you want to use some,
this too must be included in the README file:
If, for instance, you wanted to use two slip-days, you would add the following
line:
Your name, student ID, and email address should also appear as comments at the top
of your Makefile and each source file.
Your tarball should have a name of the form lab2b-studentID.tar.gz.
You can sanity check your submission with this
test script.
Projects that do not pass the test script will not be accepted!
We will test it on a departmental Linux server.
You would be well advised to test your submission on that platform before submitting it.
GRADING:
Points for this project will be awarded:
| value | feature |
| Packaging and build (10% total) |
| 2% | un-tars expected contents |
| 3% | clean build of correct programs w/default action (no warnings) |
| 3% | Makefile has working clean,
dist, tests, profile
and graphs targets |
| 2% | reasonableness of README contents |
|
| Code Review (20%) |
| 4% | overall readability and reasonableness |
| 4% | multi-list implementation |
| 4% | thread correctly sums up the length across sub-lists |
| 4% | mutex use on multi-lists |
| 4% | spin-lock use on multi-lists |
|
| Results (40% total) ... produces correct output for |
| 5% | lists |
| 5% | correct mutex |
| 5% | correct spin |
| 5% | reasonable time per operation reporting |
| 5% | reasonable wait for mutex reporting |
| 10% | graphs (showed what we asked for) |
| 5% | profiling report (clearly shows where CPU time went) |
|
| Analysis (30% total) ... reasonably explained all results in README |
| 2% | General clarity of thought and understanding |
| 7% | 2.3.1 where the CPU time goes |
| 7% | 2.3.2 profiling |
| 7% | 2.3.3 wait time |
| 7% | 2.3.4 list partitioning |
Note:
if your program does not accept the correct options or produce the
correct output, you are likely to receive a zero for the results portion
of your grade. Look carefully at the sample commands and output.
If you have questions, ask.