CS 134

Compare and Swap

Suppose that we wanted to compute x = 2*(x+1); atomically. We could try writing it on an x86 machine as

    lock incl   x       ; x += 1    (atomic increment)
    lock sall   x       ; x <<= 1   (atomic left shift by one => mult by 2)

What's the problem with this code?

Similarly, let's consider this code for adding a list node onto the front of a linked list:

// declare malloc because reasons
void *malloc(size_t size);

struct list_node {
        int data;
        struct list_node *next;
};

struct list_node *head;

void add_to_list(int value) {
        struct list_node *new_node = malloc(sizeof(struct list_node));
        new_node->data = value;
        new_node->next = head;
        head = new_node;
}

The assembly code for this function might look something like

add_to_list:
    pushl   %esi                ; save esi because we're gonna use it
    subl    $8, %esp            ; argument for malloc goes on the stack
    movl    $8, (%esp)          ; size of struct list_node
    calll   malloc              ; allocate memory for new node, address in eax
    movl    16(%esp), %esi      ; load value into esi
    movl    %esi, (%eax)        ; new_node->data = value
    movl    head, %ecx          ; load head into ecx
    movl    %ecx, 4(%eax)       ; new_node->next = head
    movl    %eax, head          ; head = new_node
    addl    $8, %esp            ; clean up stack from malloc argument
    popl    %esi                ; restore esi
    retl                        ; return
  • Hedgehog speaking

    Do I need to study this assembly code closely?

  • PinkRobot speaking

    No. It's just a reminder that ultimately all your code becomes instructions like these. We're not going to ask you to write assembly code like this, but it's good to have a broad sense of what's going on under the hood.

Again, we can see that there is no single instruction that adds a linked-list element—the process requires several instructions, and, in particular, we need to both move the old value of head to the next pointer of the new node and then update head to point to the new node, and we want to do that atomically.

  • Goat speaking

    Meh. So basically atomic instruction aren't up to the job. Got it.

  • PinkRobot speaking

    Actually, it turns out we just need a better atomic instruction.

Compare and Swap to the Rescue

It can be argued that the problem is with the read-modify-write concept we've been working with, as this approach always unconditionally updates the value.

Let's rethink the problem. What if we could read the value of head, check that it's still the same as when we read it, and then update it only if it's still the same? This is the idea behind the compare-and-swap instruction.

Here's pseudocode for the compare_and_swap instruction, based on the compiler's intrinsic function __atomic_compare_exchange_n:

/* compare_and_swap: atomically update *ptr if it equals expected
 * (as C code, this isn't atomic, but as a special CPU instruction it is)
 */
bool
compare_and_swap(int *ptr, int expected, int new) {
        int old = *ptr;
        if (old == expected) {
                *ptr = new;
                return true;
        }
        return false;
}

Solving x = 2*(x+1) with Compare and Swap

Let's see how we can use compare_and_swap to solve the x = 2*(x+1) problem atomically:

void atomic_increment_and_double(int *x) {
        int expected, new;
        do {
                expected = *x;
                new = 2 * (expected + 1);
        } while (!compare_and_swap(x, expected, new));
}

Explain why this code works and is effectively atomic. Also, do you have any concerns about the code?

  • Cat speaking

    I think I see why it works. The compare_and_swap only updates x if *x is still equal to expected. If another thread has modified x in the meantime, the compare_and_swap will fail, and the loop will try again.

  • PinkRobot speaking

    That's right. Do you see any potential concerns with this code?

  • Dog speaking

    Well, it could keep failing right? Going 'round and 'round in circles chasing its tail?

  • PinkRobot speaking

    In principle, if there are lots of threads contending to update x, compare_and_swap could keep failing many times in a row. But in practice, it's unlikely to be a problem.

  • BlueRobot speaking

    The more time we have between reading the initial value (expected) and writing the new value (new), the more likely compare_and_swap is to loop. In this case, there's hardly any gap between the read and write, so it's not a big concern.

Solving Linked-List Insertion with Compare and Swap

Let's see how we can use compare_and_swap to solve the linked-list–insertion problem atomically.

void atomic_add_to_list(int value) {
        struct list_node *new_node = malloc(sizeof(struct list_node));
        do {
                new_node->data = value;
                new_node->next = head;
        } while (!compare_and_swap(&head, new_node->next, new_node));
}
  • Duck speaking

    This is so cool. No locks to worry about!

  • PinkRobot speaking

    It is neat, isn't it? Let's take a look at an example.

Lock-Free Producer–Consumer Problem

Here's a version of the producer–consumer problem that uses compare_and_swap instead of locks. (The code also uses the C11 header stdatomic.h for atomic operations—you should be able to get a sense of what the functions mean from their names.)

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <unistd.h>
#include <sched.h>

#define NUM_PRODUCERS 2
#define NUM_CONSUMERS 2
#define ITEMS_PER_PRODUCER 10

// Sentinel value
#define NO_DATA -1 // Stack is empty (transitory)

struct stack_node {
        int data;
        struct stack_node *next;
};

// Global stack head
static _Atomic(struct stack_node *) stack_head = NULL;

// Atomic counter for active producers
static atomic_int num_producers = 0;

// Helper function to push an item onto the stack
void
push_item(int value)
{
        struct stack_node *new_node = malloc(sizeof(struct stack_node));
        new_node->data = value;

        struct stack_node *old_head;
        do {
                old_head = atomic_load(&stack_head);
                new_node->next = old_head;
        } while (
            !atomic_compare_exchange_weak(&stack_head, &old_head, new_node));
}

// Helper function to pop an item from the stack
int
pop_item(void)
{
        while (1) {
                struct stack_node *popped_node = atomic_load(&stack_head);
                if (popped_node == NULL) {
                        return NO_DATA; // Stack is empty
                }

                struct stack_node *new_head = popped_node->next;
                if (atomic_compare_exchange_weak(&stack_head, &popped_node,
                                                 new_head)) {
                        int value = popped_node->data;
                        free(popped_node);
                        return value;
                }
                // If CAS failed, another thread modified the stack, so we try
                // again
        }
}

void *
producer(void *arg)
{
        int id = *(int *) arg;
        for (int i = 0; i < ITEMS_PER_PRODUCER; i++) {
                int value = id * 100 + i; // Just to make unique items
                push_item(value);
                printf("Producer %d pushed: %d\n", id, value);
                usleep(rand() % 100000); // Sleep up to 0.1 seconds
        }

        atomic_fetch_sub(&num_producers, 1);
        printf("Producer %d finished\n", id);
        return NULL;
}

void *
consumer(void *arg)
{
        int id = *(int *) arg;
        while (1) {
                int value = pop_item();

                if (value == NO_DATA) {
                        if (atomic_load(&num_producers) == 0) {
                                printf(
                                    "Consumer %d exiting: no more producers\n",
                                    id);
                                break;
                        }
                        sched_yield(); // Stack is empty, yield and try again
                        continue;
                }

                printf("Consumer %d popped: %d\n", id, value);
                usleep(rand() % 200000); // Sleep up to 0.2 seconds
        }
        return NULL;
}

int
main()
{
        pthread_t producers[NUM_PRODUCERS];
        pthread_t consumers[NUM_CONSUMERS];
        int producer_ids[NUM_PRODUCERS];
        int consumer_ids[NUM_CONSUMERS];

        // Set initial number of producers
        atomic_store(&num_producers, NUM_PRODUCERS);

        // Create producer threads
        for (int i = 0; i < NUM_PRODUCERS; i++) {
                producer_ids[i] = i;
                pthread_create(&producers[i], NULL, producer, &producer_ids[i]);
        }

        // Create consumer threads
        for (int i = 0; i < NUM_CONSUMERS; i++) {
                consumer_ids[i] = i;
                pthread_create(&consumers[i], NULL, consumer, &consumer_ids[i]);
        }

        // Wait for all threads to complete
        for (int i = 0; i < NUM_PRODUCERS; i++) {
                pthread_join(producers[i], NULL);
        }
        for (int i = 0; i < NUM_CONSUMERS; i++) {
                pthread_join(consumers[i], NULL);
        }

        return 0;
}

You can also contrast it with the locks-and-condition-variables version from Week 3, Lesson 1.

Contrast the lock-based and lock-free versions of the producer–consumer problem. What do you think? Do you have any concerns with the code we've shown?

  • Horse speaking

    Hay! You totally cheated! The original version was the bounded buffer problem (with BUFFER_SIZE elements), and now you've made it unbounded.

  • Dog speaking

    And the consumers spin around chasing their tails until they get some data; they used to call pthread_cond_wait and take a nap.

  • Cat speaking

    I noticed that the lock-based version uses a queue for the data, while the lock-free version uses a stack. Why is that?

  • PinkRobot speaking

    These are all good observations. We made some changes to the problem to make it easier to implement it as lock-free. Let's see why…

Limitations of Compare and Swap

Compare and swap only works on a single memory location (technically, at most a “machine word” in size; 64-bits on a 64-bit system and 32-bits on a 32-bit system). It's a powerful tool, but it can't turn every synchronization problem into lock-free atomic goodness.

In particular, lock-free queues require more work than just updating a single head pointer, and so are challenging to implement in the general case. There are general-purpose lock-free queues, but they're complex and beyond the scope of this course.

  • Horse speaking

    Hay! Wait a minute—I have an idea! Don't we say that “Worse is Better” in CS 134?

  • PinkRobot speaking

    Sometimes, in the sense that it's better to have a simple solution that works well enough than a complex solution that's perfect.

  • Horse speaking

    Well, you were talking about general-purpose lock-free queues in the general case, which sounds like they've got all the bells and whistles. But is there a “good enough” solution that's simpler and we could implement?

  • Rabbit speaking

    You know you want to….

  • PinkRobot speaking

    Okay, okay, you've got me. We'll look at a simple lock-free queue…

A “Good Enough” Lock-Free Queue

If we look back at our old lock-based code with a bounded buffer, we see it has this struct:

struct circular_buffer {
        struct message buffer[BUFFER_SIZE];
        int head;
        int tail;
        int count;
        pthread_mutex_t mutex;
        pthread_cond_t not_full;
        pthread_cond_t not_empty;
};

We obviously don't need the mutex and condition variables in a lock-free version. But it's also the case that we never really needed tail, either, as the value of tail can be computed from head and count (specifically, tail = (head + count) % BUFFER_SIZE). So we can simplify our struct to

struct circular_buffer {
        struct message buffer[BUFFER_SIZE];
        int head;
        int count;
};

Now we're only updating two 32-bit ints that are right next to each other, so we can instead see them as a single 64-bit value. We can use compare_and_swap to update this 64-bit value atomically.

Since this is a bit of a rabbit hole, we won't put the code on the page but you can try it out:

  • Horse speaking

    Hay! I looked at the code and it's more complicated. You also used compare-and-swap to put things into buffer.

  • PinkRobot speaking

    That's true. In fact, a fair bit of head scratching was involved in trying to be sure it was right, even though this is the purportedly “simple” solution.

  • Cat speaking

    I also noticed that sometimes you used atomic_compare_exchange_strong and sometimes atomic_compare_exchange_weak. What's the difference?

  • PinkRobot speaking

    Uh…

  • Rabbit speaking

    The weak version is allowed to spuriously fail, but can be more efficient. The rule of thumb is to use weak when you're already in a loop and a spurious failure is cheap, and strong when you're not in a loop, or when retrying the entire operation is expensive.

  • PinkRobot speaking

    But now we've gone way too far down the rabbit hole. Let's come back to some take-aways…

  • Rabbit speaking

    But wait, I made an even better version that's less likely to be livelock-ish because it makes better guarantees of progress. Wanna check my work and make sure I didn't make a mistake somewhere?

  • Hedgehog speaking

    I know you're super into this stuff, but please stop, my head is spinning.

  • PinkRobot speaking

    Probably Rabbit's code is fine, but you don't need to look at it unless you really want to. I think it might be an improvement, but this kind of code is hard to get right.

Takeaways

  • Compare-and-swap is a powerful atomic operation that can be used to implement lock-free data structures.
  • A classic pattern in compare-and-swap code is to read a value, compute a new value based on the old one, then try to update the value with compare_and_swap. If the update fails, try again.
  • The compare-and-swap operation only works on a machine-word–sized value, so it can't be used to atomically update multiple values (unless those values can all be squeezed into a single machine word).
  • Lock-free data structures can be more efficient than lock-based ones, since they avoid the overhead of locks.
  • Lock-free data structures are not a panacea—they have limitations and can be challenging to implement in the general case, and they aren't always easier to reason about than lock-based ones.

(When logged in, completion status appears here.)