Compare and Swap

Suppose that we wanted to compute x = 2*(x+1); atomically. We could try writing it on an x86 machine as

    lock incl   x       ; x += 1    (atomic increment)
    lock sall   x       ; x <<= 1   (atomic left shift by one => mult by 2)

Similarly, let's consider this code for adding a list node onto the front of a linked list:

// declare malloc because reasons
void *malloc(size_t size);

struct list_node {
        int data;
        struct list_node *next;
};

struct list_node *head;

void add_to_list(int value) {
        struct list_node *new_node = malloc(sizeof(struct list_node));
        new_node->data = value;
        new_node->next = head;
        head = new_node;
}

The assembly code for this function might look something like

add_to_list:
    pushl   %esi                ; save esi because we're gonna use it
    subl    $8, %esp            ; argument for malloc goes on the stack
    movl    $8, (%esp)          ; size of struct list_node
    calll   malloc              ; allocate memory for new node, address in eax
    movl    16(%esp), %esi      ; load value into esi
    movl    %esi, (%eax)        ; new_node->data = value
    movl    head, %ecx          ; load head into ecx
    movl    %ecx, 4(%eax)       ; new_node->next = head
    movl    %eax, head          ; head = new_node
    addl    $8, %esp            ; clean up stack from malloc argument
    popl    %esi                ; restore esi
    retl                        ; return

Do I need to study this assembly code closely?
No. It's just a reminder that ultimately all your code becomes instructions like these. We're not going to ask you to write assembly code like this, but it's good to have a broad sense of what's going on under the hood.

Again, we can see that there is no single instruction that adds a linked-list element—the process requires several instructions, and, in particular, we need to both move the old value of head to the next pointer of the new node and then update head to point to the new node, and we want to do that atomically.

Meh. So basically atomic instruction aren't up to the job. Got it.
Actually, it turns out we just need a better atomic instruction.

Compare and Swap to the Rescue

It can be argued that the problem is with the read-modify-write concept we've been working with, as this approach always unconditionally updates the value.

Let's rethink the problem. What if we could read the value of head, check that it's still the same as when we read it, and then update it only if it's still the same? This is the idea behind the compare-and-swap instruction.

Here's pseudocode for the compare_and_swap instruction, based on the compiler's intrinsic function __atomic_compare_exchange_n:

/* compare_and_swap: atomically update *ptr if it equals expected
 * (as C code, this isn't atomic, but as a special CPU instruction it is)
 */
bool
compare_and_swap(int *ptr, int expected, int new) {
        int old = *ptr;
        if (old == expected) {
                *ptr = new;
                return true;
        }
        return false;
}

Solving `x = 2*(x+1)` with Compare and Swap

Let's see how we can use compare_and_swap to solve the x = 2*(x+1) problem atomically:

void atomic_increment_and_double(int *x) {
        int expected, new;
        do {
                expected = *x;
                new = 2 * (expected + 1);
        } while (!compare_and_swap(x, expected, new));
}

I think I see why it works. The compare_and_swap only updates x if *x is still equal to expected. If another thread has modified x in the meantime, the compare_and_swap will fail, and the loop will try again.
That's right. Do you see any potential concerns with this code?
Well, it could keep failing right? Going 'round and 'round in circles chasing its tail?
In principle, if there are lots of threads contending to update x, compare_and_swap could keep failing many times in a row. But in practice, it's unlikely to be a problem.
The more time we have between reading the initial value (expected) and writing the new value (new), the more likely compare_and_swap is to loop. In this case, there's hardly any gap between the read and write, so it's not a big concern.

Solving Linked-List Insertion with Compare and Swap

Let's see how we can use compare_and_swap to solve the linked-list–insertion problem atomically.

void atomic_add_to_list(int value) {
        struct list_node *new_node = malloc(sizeof(struct list_node));
        do {
                new_node->data = value;
                new_node->next = head;
        } while (!compare_and_swap(&head, new_node->next, new_node));
}

This is so cool. No locks to worry about!
It is neat, isn't it? Let's take a look at an example.

Lock-Free Producer–Consumer Problem

Here's a version of the producer–consumer problem that uses compare_and_swap instead of locks. (The code also uses the C11 header stdatomic.h for atomic operations—you should be able to get a sense of what the functions mean from their names.)

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <unistd.h>
#include <sched.h>

#define NUM_PRODUCERS 2
#define NUM_CONSUMERS 2
#define ITEMS_PER_PRODUCER 10

// Sentinel value
#define NO_DATA -1 // Stack is empty (transitory)

struct stack_node {
        int data;
        struct stack_node *next;
};

// Global stack head
static _Atomic(struct stack_node *) stack_head = NULL;

// Atomic counter for active producers
static atomic_int num_producers = 0;

// Helper function to push an item onto the stack
void
push_item(int value)
{
        struct stack_node *new_node = malloc(sizeof(struct stack_node));
        new_node->data = value;

        struct stack_node *old_head;
        do {
                old_head = atomic_load(&stack_head);
                new_node->next = old_head;
        } while (
            !atomic_compare_exchange_weak(&stack_head, &old_head, new_node));
}

// Helper function to pop an item from the stack
int
pop_item(void)
{
        while (1) {
                struct stack_node *popped_node = atomic_load(&stack_head);
                if (popped_node == NULL) {
                        return NO_DATA; // Stack is empty
                }

                struct stack_node *new_head = popped_node->next;
                if (atomic_compare_exchange_weak(&stack_head, &popped_node,
                                                 new_head)) {
                        int value = popped_node->data;
                        free(popped_node);
                        return value;
                }
                // If CAS failed, another thread modified the stack, so we try
                // again
        }
}

void *
producer(void *arg)
{
        int id = *(int *) arg;
        for (int i = 0; i < ITEMS_PER_PRODUCER; i++) {
                int value = id * 100 + i; // Just to make unique items
                push_item(value);
                printf("Producer %d pushed: %d\n", id, value);
                usleep(rand() % 100000); // Sleep up to 0.1 seconds
        }

        atomic_fetch_sub(&num_producers, 1);
        printf("Producer %d finished\n", id);
        return NULL;
}

void *
consumer(void *arg)
{
        int id = *(int *) arg;
        while (1) {
                int value = pop_item();

                if (value == NO_DATA) {
                        if (atomic_load(&num_producers) == 0) {
                                printf(
                                    "Consumer %d exiting: no more producers\n",
                                    id);
                                break;
                        }
                        sched_yield(); // Stack is empty, yield and try again
                        continue;
                }

                printf("Consumer %d popped: %d\n", id, value);
                usleep(rand() % 200000); // Sleep up to 0.2 seconds
        }
        return NULL;
}

int
main()
{
        pthread_t producers[NUM_PRODUCERS];
        pthread_t consumers[NUM_CONSUMERS];
        int producer_ids[NUM_PRODUCERS];
        int consumer_ids[NUM_CONSUMERS];

        // Set initial number of producers
        atomic_store(&num_producers, NUM_PRODUCERS);

        // Create producer threads
        for (int i = 0; i < NUM_PRODUCERS; i++) {
                producer_ids[i] = i;
                pthread_create(&producers[i], NULL, producer, &producer_ids[i]);
        }

        // Create consumer threads
        for (int i = 0; i < NUM_CONSUMERS; i++) {
                consumer_ids[i] = i;
                pthread_create(&consumers[i], NULL, consumer, &consumer_ids[i]);
        }

        // Wait for all threads to complete
        for (int i = 0; i < NUM_PRODUCERS; i++) {
                pthread_join(producers[i], NULL);
        }
        for (int i = 0; i < NUM_CONSUMERS; i++) {
                pthread_join(consumers[i], NULL);
        }

        return 0;
}

Try this code in Online GDB

You can also contrast it with the locks-and-condition-variables version from Week 3, Lesson 1.

Check out the old code in Online GDB

Hay! You totally cheated! The original version was the bounded buffer problem (with BUFFER_SIZE elements), and now you've made it unbounded.
And the consumers spin around chasing their tails until they get some data; they used to call pthread_cond_wait and take a nap.
I noticed that the lock-based version uses a queue for the data, while the lock-free version uses a stack. Why is that?
These are all good observations. We made some changes to the problem to make it easier to implement it as lock-free. Let's see why…

Limitations of Compare and Swap

Compare and swap only works on a single memory location (technically, at most a “machine word” in size; 64-bits on a 64-bit system and 32-bits on a 32-bit system). It's a powerful tool, but it can't turn every synchronization problem into lock-free atomic goodness.

In particular, lock-free queues require more work than just updating a single head pointer, and so are challenging to implement in the general case. There are general-purpose lock-free queues, but they're complex and beyond the scope of this course.

Hay! Wait a minute—I have an idea! Don't we say that “Worse is Better” in CS 134?
Sometimes, in the sense that it's better to have a simple solution that works well enough than a complex solution that's perfect.
Well, you were talking about general-purpose lock-free queues in the general case, which sounds like they've got all the bells and whistles. But is there a “good enough” solution that's simpler and we could implement?
You know you want to….
Okay, okay, you've got me. We'll look at a simple lock-free queue…

A “Good Enough” Lock-Free Queue

If we look back at our old lock-based code with a bounded buffer, we see it has this struct:

struct circular_buffer {
        struct message buffer[BUFFER_SIZE];
        int head;
        int tail;
        int count;
        pthread_mutex_t mutex;
        pthread_cond_t not_full;
        pthread_cond_t not_empty;
};

We obviously don't need the mutex and condition variables in a lock-free version. But it's also the case that we never really needed tail, either, as the value of tail can be computed from head and count (specifically, tail = (head + count) % BUFFER_SIZE). So we can simplify our struct to

struct circular_buffer {
        struct message buffer[BUFFER_SIZE];
        int head;
        int count;
};

Now we're only updating two 32-bit ints that are right next to each other, so we can instead see them as a single 64-bit value. We can use compare_and_swap to update this 64-bit value atomically.

Since this is a bit of a rabbit hole, we won't put the code on the page but you can try it out:

Try this code in Online GDB

Hay! I looked at the code and it's more complicated. You also used compare-and-swap to put things into buffer.
That's true. In fact, a fair bit of head scratching was involved in trying to be sure it was right, even though this is the purportedly “simple” solution.
I also noticed that sometimes you used atomic_compare_exchange_strong and sometimes atomic_compare_exchange_weak. What's the difference?
Uh…
The weak version is allowed to spuriously fail, but can be more efficient. The rule of thumb is to use weak when you're already in a loop and a spurious failure is cheap, and strong when you're not in a loop, or when retrying the entire operation is expensive.
But now we've gone way too far down the rabbit hole. Let's come back to some take-aways…
But wait, I made an even better version that's less likely to be livelock-ish because it makes better guarantees of progress. Wanna check my work and make sure I didn't make a mistake somewhere?
I know you're super into this stuff, but please stop, my head is spinning.
Probably Rabbit's code is fine, but you don't need to look at it unless you really want to. I think it might be an improvement, but this kind of code is hard to get right.

Takeaways

Compare-and-swap is a powerful atomic operation that can be used to implement lock-free data structures.
A classic pattern in compare-and-swap code is to read a value, compute a new value based on the old one, then try to update the value with compare_and_swap. If the update fails, try again.
The compare-and-swap operation only works on a machine-word–sized value, so it can't be used to atomically update multiple values (unless those values can all be squeezed into a single machine word).
Lock-free data structures can be more efficient than lock-based ones, since they avoid the overhead of locks.
Lock-free data structures are not a panacea—they have limitations and can be challenging to implement in the general case, and they aren't always easier to reason about than lock-based ones.

(When logged in, completion status appears here.)

Previous Page Next Page

CS 134

Warning: Missing Prerequisites

Compare and Swap

Compare and Swap to the Rescue

Solving `x = 2*(x+1)` with Compare and Swap

Solving Linked-List Insertion with Compare and Swap

Lock-Free Producer–Consumer Problem

Limitations of Compare and Swap

A “Good Enough” Lock-Free Queue

Takeaways

Instructor Controls

Warning: Missing Prerequisites

Compare and Swap

Compare and Swap to the Rescue

Solving x = 2*(x+1) with Compare and Swap

Solving Linked-List Insertion with Compare and Swap

Lock-Free Producer–Consumer Problem

Limitations of Compare and Swap

A “Good Enough” Lock-Free Queue

Takeaways

Instructor Controls

Solving `x = 2*(x+1)` with Compare and Swap