CS 134

Test and Set

Back in Week 2, Lesson 2, we introduced spinlocks as a simple synchronization primitive. We described them as follows:

A spinlock is a simple lock that spins in a loop until it can acquire the lock.

But how does a spinlock actually work? Let's try to write it. First, we'll have a simple structure to represent the lock (slightly simpler than what we have in OS/161, since we're omitting the name field and some other details):

struct spinlock {
    int locked;
};

And now let's try to write the spinlock_acquire function:

void
spinlock_acquire(struct spinlock *lock)
{
        while (lock->locked) {
                // spin
        }
        // We now know lock->locked is 0, so we can acquire the lock
        lock->locked = 1;
}

Alas, there are multiple problems here…

Deadly Compiler Optimizations

The compiler might optimize the while loop to something like

void spinlock_acquire(struct spinlock *lock) {
        if (lock->locked) {
                while (1) {
                        // spin
                }
        }
        // We now know lock->locked is 0, so we can acquire the lock
        lock->locked = 1;
}

It thinks it can do that because we haven't called any external functions, so the compiler has no reason to suppose that lock->locked will ever change. So it will think, “Why bother checking it again? This is just an infinite loop!”

We can fix this issue by declaring lock as volatile:

struct spinlock {
        volatile int locked;
};

The volatile keyword tells the compiler that the value of lock->locked can change at any time, so it should not optimize the loop in this way.

Preemption

Suppose a timer interrupt occurs right at the point where we have the comment

// We now know lock->locked is 0, so we can acquire the lock

and the OS switches to another thread (a.k.a., we are preempted). The other thread runs and acquires the lock. When the OS returns to our initial thread and picks back up at the comment, we will think we are able to acquire the lock even though another thread might have acquired it in the meantime.

For now, for simplicity, let's assume for now that we're on a uniprocessor (one core) system.

We could fix this issue by disabling interrupts before checking the lock and reenabling them after acquiring the lock. Here's an attempt:

void
spinlock_acquire(struct spinlock *lock)
{
        int spl = splhigh();        // Disable interrupts
        while (lock->locked) {
                // spin
        }
        lock->locked = 1;
        splx(spl);                  // Restore interrupt level
}

Unfortunately, this code is really badly broken. Can you see why?

  • Dog speaking

    It chases its tail forever! While we're spinning, we've left interrupts disabled, so no one else can run, and no one can release the lock.

  • Duck speaking

    I think there is a hint in the name of this page—we need a test-and-set helper function!

  • PinkRobot speaking

    Good idea. Let's see how that might work.

Let's revise the code to add a helper function, test_and_set, that atomically sets a value and returns the old value. We can use this function to implement the spinlock.

/*
 * This function always sets the memory at *value to 1, but, importantly
 * - It returns the old value of *value; and,
 * - It reads and writes *value atomically by disabling interrupts.
 */
int
test_and_set(volatile int *value) {
        int spl = splhigh();        // Disable interrupts
        int old = *value;
        *value = 1;
        splx(spl);                  // Restore interrupt level
        return old;
}

void
spinlock_acquire(struct spinlock *lock)
{
        while (test_and_set(&lock->locked)) {
                // spin
        }
}
  • Horse speaking

    Hay! I'm not sure what the while loop is doing here. Can you explain it to me?

  • PinkRobot speaking

    Certainly. Remember that test_and_set always returns the old value of the variable before we set it to 1, so if it was already set to 1 (i.e., something is already holding the lock), test_and_set returns true, and we keep spinning. If the lock variable's value was 0 (i.e., nothing is holding the lock), test_and_set returns false, and we break out of the spin loop, but we have also set the lock to 1.

  • Dog speaking

    And we can't be interrupted during the test and set, but we can be interrupted while we spin. No more infinite tail chasing!

There's just one problem…

  • Hedgehog speaking

    Oh, no.

What's the problem?

  • Duck speaking

    It won't work on a multiprocessor system! When we disable interrupts, we're only disabling them on the current processor.

  • PinkRobot speaking

    That's right. Our atomicity is only guaranteed on the current processor.

There are ways to fix the issue by writing code, and in the early days of computer science, that's what people did. For example, in 1974 Leslie Lamport wrote A New Solution of Dijkstra's Concurrent Programming Problem that introduced “The Bakery Algorithm” for mutual exclusion, which works a bit like taking a number at a bakery. But these solutions are complex, not especially efficient, and error-prone to implement.

There is a better way, but it meant getting the people who make the hardware (CPUs) to help.

  • Rabbit speaking

    Lamport's algorithm still does see some use in some specialized cases; particularly in distributed systems. But for general-purpose systems, we have better tools now.

Atomic Instructions

Faced with the challenge of writing functions like test_and_set that work on multiprocessor systems, computer scientists asked the processor designers for help in creating special instructions that would be atomic even on multiprocessor systems. These are called atomic instructions.

One of the earliest of these was the test-and-set instruction, which is often abbreviated to tas. It's a single instruction that reads a value from memory, sets it to 1, and returns the old value, just like our test_and_set function. The key difference is that because tas is a single CPU instruction rather than a function, it's guaranteed to be atomic, even on multiprocessor systems.

Here's the description of tas from the manual for the Motorola 68000 series of processors (from 1979):

Description: Test and set the byte operand addressed by the effective address field. The N- and Z-bits of the CCR are updated accordingly. The high-order bit of the operand (i.e., bit 7) is set. This operation is indivisible and uses a read-modify-write cycle.

Application: Its principal application is in multiprocessor systems. The TAS instruction permits one processor in a multiprocessor system to test a resource (e.g., shared memory) and claim the resource if it is free. The most-significant bit of the byte at the effective address is used as a semaphore to indicate whether the shared resource is free. The TAS instruction reads the semaphore bit to find the state of the resource, and then sets the semaphore to claim the resource (if it was free). Because the operation is indivisible, no other processor can access the memory between the testing of the bit and its subsequent setting.

  • Cat speaking

    I don't think they mean “semaphore” there.

  • PinkRobot speaking

    No, these days we'd call it a lock, but the idea is the same.

  • Rabbit speaking

    Fun fact, IBM's System/360 had a TS instruction that was similar, and that was back in 1964!

What's notable here:

  • The application is specifically to support synchronization in multiprocessor systems.
  • The hardware designers had to provide an “indivisible” operation, which means that the operation is guaranteed to complete without interruption, via a special “read-modify-write” cycle at the hardware level.

The Intel Answer: xchg

Over at Intel, they decided that if the hardware was going to provide a special “read-modify-write” instruction, they could do one better and provide a general-purpose instruction that could atomically swap two values: the xchg instruction. It's a bit more general than tas, but it can be used to implement tas and a whole lot more.

A uniprocessor hand-written interrupt-disabling version of xchg might look like

int
xchg(volatile int *value, int newval) {
        int spl = splhigh();        // Disable interrupts
        int old = *value;
        *value = newval;
        splx(spl);                  // Restore interrupt level
        return old;
}

The built-into-the-CPU version is obviously better, since it uses a single CPU instruction and is guaranteed to be atomic across all processors.

We can implement test_and_set using xchg as

int
test_and_set(volatile int *value) {
        return xchg(value, 1);
}

The ARM architecture used in current Macs has a similar instruction called swp (swap).

A Deeper Dive: The lock Prefix

In fact, with the advent of the 80386 CPU in 1985, Intel introduced a lock prefix that could be used with certain instructions to make them atomic. For example, the xchg instruction was always atomic, but other instructions could be made atomic with this prefix, generating a locked “read-modify-write” cycle. It worked by locking the memory bus so that no other processor could access memory while the instruction was being executed.

So, for example, we could increment a counter atomically in x86 assembly like

        lock addl $1, counter

This change gave us a pretty powerful feature, but it was also a bit of a sledgehammer. It locked the memory bus for the entire instruction, which could be a bit of a performance hit. While still supported, it's not used as much these days, as we'll see in the next lesson.

Using Test-and-Set in Your Code

Although test-and-set is a processor instruction, many C compilers make this instruction available to you in some form. For example, in GCC, Clang, and the Intel C Compiler, you can use the built-in magic function __atomic_test_and_set:

void spinlock_acquire(struct spinlock *lock) {
    while (__atomic_test_and_set(&lock->locked, __ATOMIC_ACQUIRE)) {
        // spin
    }
}
  • Horse speaking

    Hay! What's that __ATOMIC_ACQUIRE thing?

  • PinkRobot speaking

    It refers to the memory ordering model. We may talk about that in a future lesson.

  • L-Chippy speaking

    Modern hardware is pretty weird, because of caches and such. Various tricks that compilers and processors do to run serial code faster can cause problems in concurrent code. The memory ordering model is a way to tell the compiler and processor how to handle memory accesses in a way that's safe for concurrent code.

  • PinkRobot speaking

    For now, though, don't worry about it too much. Just know it's a thing. It's also why we're not worrying about using atomic instructions in our OS/161 code.

Here's the problem we solved using reader–writer locks last lesson using the __atomic_... functions provided by GCC:

And as a reminder, here's the code from last lesson using reader–writer locks:

When you contrast the atomic-instructions version of the code against the reader–writer locks version, what do you notice.

(When logged in, completion status appears here.)