Compare and Swap
Suppose that we wanted to compute x = 2*(x+1); atomically. We could try writing it on an x86 machine as
lock incl x ; x += 1 (atomic increment)
lock sall x ; x <<= 1 (atomic left shift by one => mult by 2)
Similarly, let's consider this code for adding a list node onto the front of a linked list:
// declare malloc because reasons
void *malloc(size_t size);
struct list_node {
int data;
struct list_node *next;
};
struct list_node *head;
void add_to_list(int value) {
struct list_node *new_node = malloc(sizeof(struct list_node));
new_node->data = value;
new_node->next = head;
head = new_node;
}
The assembly code for this function might look something like
add_to_list:
pushl %esi ; save esi because we're gonna use it
subl $8, %esp ; argument for malloc goes on the stack
movl $8, (%esp) ; size of struct list_node
calll malloc ; allocate memory for new node, address in eax
movl 16(%esp), %esi ; load value into esi
movl %esi, (%eax) ; new_node->data = value
movl head, %ecx ; load head into ecx
movl %ecx, 4(%eax) ; new_node->next = head
movl %eax, head ; head = new_node
addl $8, %esp ; clean up stack from malloc argument
popl %esi ; restore esi
retl ; return
Do I need to study this assembly code closely?
No. It's just a reminder that ultimately all your code becomes instructions like these. We're not going to ask you to write assembly code like this, but it's good to have a broad sense of what's going on under the hood.
Again, we can see that there is no single instruction that adds a linked-list element—the process requires several instructions, and, in particular, we need to both move the old value of head to the next pointer of the new node and then update head to point to the new node, and we want to do that atomically.
Meh. So basically atomic instruction aren't up to the job. Got it.
Actually, it turns out we just need a better atomic instruction.
Compare and Swap to the Rescue
It can be argued that the problem is with the read-modify-write concept we've been working with, as this approach always unconditionally updates the value.
Let's rethink the problem. What if we could read the value of head, check that it's still the same as when we read it, and then update it only if it's still the same? This is the idea behind the compare-and-swap instruction.
Here's pseudocode for the compare_and_swap instruction, based on the compiler's intrinsic function __atomic_compare_exchange_n:
/* compare_and_swap: atomically update *ptr if it equals expected
* (as C code, this isn't atomic, but as a special CPU instruction it is)
*/
bool
compare_and_swap(int *ptr, int expected, int new) {
int old = *ptr;
if (old == expected) {
*ptr = new;
return true;
}
return false;
}
Solving x = 2*(x+1) with Compare and Swap
Let's see how we can use compare_and_swap to solve the x = 2*(x+1) problem atomically:
void atomic_increment_and_double(int *x) {
int expected, new;
do {
expected = *x;
new = 2 * (expected + 1);
} while (!compare_and_swap(x, expected, new));
}
I think I see why it works. The
compare_and_swaponly updatesxif*xis still equal toexpected. If another thread has modifiedxin the meantime, thecompare_and_swapwill fail, and the loop will try again.
That's right. Do you see any potential concerns with this code?
Well, it could keep failing right? Going 'round and 'round in circles chasing its tail?
In principle, if there are lots of threads contending to update
x,compare_and_swapcould keep failing many times in a row. But in practice, it's unlikely to be a problem.
The more time we have between reading the initial value (
expected) and writing the new value (new), the more likelycompare_and_swapis to loop. In this case, there's hardly any gap between the read and write, so it's not a big concern.
Solving Linked-List Insertion with Compare and Swap
Let's see how we can use compare_and_swap to solve the linked-list–insertion problem atomically.
void atomic_add_to_list(int value) {
struct list_node *new_node = malloc(sizeof(struct list_node));
do {
new_node->data = value;
new_node->next = head;
} while (!compare_and_swap(&head, new_node->next, new_node));
}
This is so cool. No locks to worry about!
It is neat, isn't it? Let's take a look at an example.
Lock-Free Producer–Consumer Problem
Here's a version of the producer–consumer problem that uses compare_and_swap instead of locks. (The code also uses the C11 header stdatomic.h for atomic operations—you should be able to get a sense of what the functions mean from their names.)
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <unistd.h>
#include <sched.h>
#define NUM_PRODUCERS 2
#define NUM_CONSUMERS 2
#define ITEMS_PER_PRODUCER 10
// Sentinel value
#define NO_DATA -1 // Stack is empty (transitory)
struct stack_node {
int data;
struct stack_node *next;
};
// Global stack head
static _Atomic(struct stack_node *) stack_head = NULL;
// Atomic counter for active producers
static atomic_int num_producers = 0;
// Helper function to push an item onto the stack
void
push_item(int value)
{
struct stack_node *new_node = malloc(sizeof(struct stack_node));
new_node->data = value;
struct stack_node *old_head;
do {
old_head = atomic_load(&stack_head);
new_node->next = old_head;
} while (
!atomic_compare_exchange_weak(&stack_head, &old_head, new_node));
}
// Helper function to pop an item from the stack
int
pop_item(void)
{
while (1) {
struct stack_node *popped_node = atomic_load(&stack_head);
if (popped_node == NULL) {
return NO_DATA; // Stack is empty
}
struct stack_node *new_head = popped_node->next;
if (atomic_compare_exchange_weak(&stack_head, &popped_node,
new_head)) {
int value = popped_node->data;
free(popped_node);
return value;
}
// If CAS failed, another thread modified the stack, so we try
// again
}
}
void *
producer(void *arg)
{
int id = *(int *) arg;
for (int i = 0; i < ITEMS_PER_PRODUCER; i++) {
int value = id * 100 + i; // Just to make unique items
push_item(value);
printf("Producer %d pushed: %d\n", id, value);
usleep(rand() % 100000); // Sleep up to 0.1 seconds
}
atomic_fetch_sub(&num_producers, 1);
printf("Producer %d finished\n", id);
return NULL;
}
void *
consumer(void *arg)
{
int id = *(int *) arg;
while (1) {
int value = pop_item();
if (value == NO_DATA) {
if (atomic_load(&num_producers) == 0) {
printf(
"Consumer %d exiting: no more producers\n",
id);
break;
}
sched_yield(); // Stack is empty, yield and try again
continue;
}
printf("Consumer %d popped: %d\n", id, value);
usleep(rand() % 200000); // Sleep up to 0.2 seconds
}
return NULL;
}
int
main()
{
pthread_t producers[NUM_PRODUCERS];
pthread_t consumers[NUM_CONSUMERS];
int producer_ids[NUM_PRODUCERS];
int consumer_ids[NUM_CONSUMERS];
// Set initial number of producers
atomic_store(&num_producers, NUM_PRODUCERS);
// Create producer threads
for (int i = 0; i < NUM_PRODUCERS; i++) {
producer_ids[i] = i;
pthread_create(&producers[i], NULL, producer, &producer_ids[i]);
}
// Create consumer threads
for (int i = 0; i < NUM_CONSUMERS; i++) {
consumer_ids[i] = i;
pthread_create(&consumers[i], NULL, consumer, &consumer_ids[i]);
}
// Wait for all threads to complete
for (int i = 0; i < NUM_PRODUCERS; i++) {
pthread_join(producers[i], NULL);
}
for (int i = 0; i < NUM_CONSUMERS; i++) {
pthread_join(consumers[i], NULL);
}
return 0;
}
You can also contrast it with the locks-and-condition-variables version from Week 3, Lesson 1.
Hay! You totally cheated! The original version was the bounded buffer problem (with
BUFFER_SIZEelements), and now you've made it unbounded.
And the consumers spin around chasing their tails until they get some data; they used to call
pthread_cond_waitand take a nap.
I noticed that the lock-based version uses a queue for the data, while the lock-free version uses a stack. Why is that?
These are all good observations. We made some changes to the problem to make it easier to implement it as lock-free. Let's see why…
Limitations of Compare and Swap
Compare and swap only works on a single memory location (technically, at most a “machine word” in size; 64-bits on a 64-bit system and 32-bits on a 32-bit system). It's a powerful tool, but it can't turn every synchronization problem into lock-free atomic goodness.
In particular, lock-free queues require more work than just updating a single head pointer, and so are challenging to implement in the general case. There are general-purpose lock-free queues, but they're complex and beyond the scope of this course.
Hay! Wait a minute—I have an idea! Don't we say that “Worse is Better” in CS 134?
Sometimes, in the sense that it's better to have a simple solution that works well enough than a complex solution that's perfect.
Well, you were talking about general-purpose lock-free queues in the general case, which sounds like they've got all the bells and whistles. But is there a “good enough” solution that's simpler and we could implement?
You know you want to….
Okay, okay, you've got me. We'll look at a simple lock-free queue…
A “Good Enough” Lock-Free Queue
If we look back at our old lock-based code with a bounded buffer, we see it has this struct:
struct circular_buffer {
struct message buffer[BUFFER_SIZE];
int head;
int tail;
int count;
pthread_mutex_t mutex;
pthread_cond_t not_full;
pthread_cond_t not_empty;
};
We obviously don't need the mutex and condition variables in a lock-free version. But it's also the case that we never really needed tail, either, as the value of tail can be computed from head and count (specifically, tail = (head + count) % BUFFER_SIZE). So we can simplify our struct to
struct circular_buffer {
struct message buffer[BUFFER_SIZE];
int head;
int count;
};
Now we're only updating two 32-bit ints that are right next to each other, so we can instead see them as a single 64-bit value. We can use compare_and_swap to update this 64-bit value atomically.
Since this is a bit of a rabbit hole, we won't put the code on the page but you can try it out:
Hay! I looked at the code and it's more complicated. You also used compare-and-swap to put things into
buffer.
That's true. In fact, a fair bit of head scratching was involved in trying to be sure it was right, even though this is the purportedly “simple” solution.
I also noticed that sometimes you used
atomic_compare_exchange_strongand sometimesatomic_compare_exchange_weak. What's the difference?
Uh…
The weak version is allowed to spuriously fail, but can be more efficient. The rule of thumb is to use weak when you're already in a loop and a spurious failure is cheap, and strong when you're not in a loop, or when retrying the entire operation is expensive.
But now we've gone way too far down the rabbit hole. Let's come back to some take-aways…
But wait, I made an even better version that's less likely to be livelock-ish because it makes better guarantees of progress. Wanna check my work and make sure I didn't make a mistake somewhere?
I know you're super into this stuff, but please stop, my head is spinning.
Probably Rabbit's code is fine, but you don't need to look at it unless you really want to. I think it might be an improvement, but this kind of code is hard to get right.
Takeaways
- Compare-and-swap is a powerful atomic operation that can be used to implement lock-free data structures.
- A classic pattern in compare-and-swap code is to read a value, compute a new value based on the old one, then try to update the value with
compare_and_swap. If the update fails, try again. - The compare-and-swap operation only works on a machine-word–sized value, so it can't be used to atomically update multiple values (unless those values can all be squeezed into a single machine word).
- Lock-free data structures can be more efficient than lock-based ones, since they avoid the overhead of locks.
- Lock-free data structures are not a panacea—they have limitations and can be challenging to implement in the general case, and they aren't always easier to reason about than lock-based ones.
(When logged in, completion status appears here.)