Threads

We've talked about processes, but what about threads?
Good question. Let's begin with a bit of history.

We can think of processes as an abstraction that creates an idealized machine for running a program. In the early days of computing, these idealized machines were inspired by the actual machines of the day, which were uniprocessors (i.e., single core). So it made sense to have a single execution state per process.

Quickly enough, multi-processor machines arrived, but these were initially rare and expensive, and the idea that programmers might want the complexity that arises from multiple execution states per process was not immediately obvious. If programmers wanted multiple tasks to run concurrently, they could always create multiple processes, and various mechanisms were devised to let processes communicate and synchronize with each other.

Special mechanisms were even created to allow processes to share memory, so that they could communicate more easily.

But programming-language designers liked the idea of providing a more lightweight way to run multiple tasks concurrently within the same program (and thus within the same process). That goal led to the development of threads.

Green Threads

In the early days of threads, they were implemented entirely in user space, without any support from the operating system's kernel. User-space threads are sometimes called green threads (or virtual threads).

I want to know MORE about why they called them green threads. Are they plant-based? (Yuck!)
That was the name Java used for its user-space threads, because they were designed by the “Green Team” at Sun Microsystems!! The name stuck, and it's now used more generally to refer to user space threads!!

Here's some example code showing a simple implementation of user-space threads from within a program:

#include <stdio.h>
#include <stdlib.h>
#include <ucontext.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>

#define STACK_SIZE 16384 /* Stack size for each thread */

/* Contexts for main and two threads */
ucontext_t contexts[3];           /* 0: main, 1: thread 1, 2: thread 2 */
int current_thread = 1;           /* Tracks the current thread */
volatile int num_threads = 0;     /* Number of threads */

void
scheduler(int signum)
{
        if (num_threads < 2) {
                printf("Too few threads to schedule\n");
                return;
        }
        int prev_thread = current_thread;

        /* Toggle between thread 1 and thread 2 */
        current_thread = (current_thread == 1) ? 2 : 1;
        printf("-- Switching from thread %d to thread %d\n", prev_thread, 
                current_thread);

        /* Swap context to the next thread */
        swapcontext(&contexts[prev_thread], &contexts[current_thread]);
}

void
thread1(void)
{
        for (int i = 1; i <= 42; i++) {
                printf("Thread 1: %d\n", i);
                usleep(100000); /* Simulate work */
        }
        printf("Thread 1 exiting\n");
        --num_threads;
}

void
thread2(void)
{
        for (int i = 1; i <= 42; i++) {
                printf("Thread 2: %d\n", i);
                usleep(100000); /* Simulate work */
        }
        printf("Thread 2 exiting\n");
        --num_threads;
}

void
newthread(void (*entrypoint)(void), ucontext_t *contextptr)
{
        void *stack = malloc(STACK_SIZE);
        if (stack == NULL) {
                perror("malloc");
                exit(EXIT_FAILURE);
        }
        if (getcontext(contextptr) == -1) {
                perror("getcontext");
                exit(EXIT_FAILURE);
        }
        contextptr->uc_stack.ss_sp = stack;
        contextptr->uc_stack.ss_size = STACK_SIZE;
        contextptr->uc_link = &contexts[0]; /* Return to main if thread exits */
        makecontext(contextptr, entrypoint, 0);
        ++num_threads;
}

int
main(void)
{
        struct sigaction sa;

        newthread(thread1, &contexts[1]);
        newthread(thread2, &contexts[2]);

        /* Set up the SIGALRM signal handler */
        sa.sa_handler = scheduler;
        sigemptyset(&sa.sa_mask);
        sa.sa_flags = SA_RESTART; /* Restart interrupted system calls */
        if (sigaction(SIGALRM, &sa, NULL) == -1) {
                perror("sigaction");
                exit(EXIT_FAILURE);
        }

        /* Configure the timer to trigger SIGALRM every 0.5 seconds */
        ualarm(500000, 500000);

        /* Start the first thread */
        current_thread = 1;
        swapcontext(&contexts[0], &contexts[1]);

        /* If our scheduler wasn't running, we'd run the first thread all 
         * the way to completion and but since it is, both threads will run 
         * concurrently.  When one of them exits (we don't know which one will
         * exit first), we'll return here. */

        /* Finish the other thread thread */
        current_thread = (current_thread == 1) ? 2 : 1;
        swapcontext(&contexts[0], &contexts[current_thread]);

        printf("Main thread exiting\n");
        return 0;
}

Try this code in Online GDB

This code uses

makecontext, swapcontext, and getcontext (provided by the C library) to manipulate the execution context of the program. Importantly, these functions work just by saving and restoring the CPU registers, so they don't require any special support from the operating-system kernel itself.
ualarm to set a timer that will interrupt the program and switch to a different thread by calling the scheduler function.
thread1 and thread2 each count up to 42 before quitting.

This code is just a simple demonstration. We could write a more full-featured scheduler that can deal with more than two threads, and we could add synchronization primitives to allow threads to communicate and synchronize with each other. In fact, the code would look a lot like what you've already seen and implemented in OS/161.

It feels like there's a bunch of code duplication to me. We've written a scheduler, context-switching code, and so on that mirrors what already exists in the operating system kernel. I already wrote locks and CVs for OS/161; I don't want to write them again!
And there can only ever be one thread running at a time, since the program is interrupted by the timer and switches to a different thread. So it can't take advantage of a multicore machine!
These are good points!
Hay! You forgot to free the memory allocated for the stacks. That's a memory leak!
True, but the program was exiting anyway, and keeping track of the memory would have made the code even more complicated. That issue isn't one we were worried about here, but it's good that you noticed it.
I've got one: What if a thread blocks because it's doing I/O? Either the timer is going to interrupt the I/O, which doesn't seem good, or the thread will be stuck until the I/O operation is complete and thus prevent any other threads from running, which also doesn't seem good.
And those issues are all due to the kernel not knowing that there are multiple threads running. It's just the program that's doing the switching.

For some time, these user-space threads were the only game in town, and even after alternatives arrived, they remained popular because they were lightweight and less prone to race conditions (precisely because nothing actually happens simultaneously).

This approach was used by

C (via the cthreads library, which later inspired the pthreads library);
Early versions of Java;
Python (via the greenlet library);
Early versions of Go (at least if you didn't adjust the MAXPROCS environment variable);

and many others.

Native Threads

Eventually (mostly in the 1990s), operating systems began to provide support for threads in the kernel itself. These threads are called native threads.

But because having multiple execution states per process was a significant change, and support needed to be added to preexisting operating systems, adding support for native threads required retrofitting this new functionality into existing operating systems, which meant that prior design decisions made in those OSs had to be revisited.

Things that Need to be Per-Thread

CPU registers: Each thread needs its own execution state.
Stack: Each thread needs its own stack (for local variables and function calls).
Thread state: Each thread needs its own state (e.g., ready, running, waiting).

Things that Need to be Per-Process

Process ID (PID): A unique identifier for the process as a whole.
Memory for program code: The program code is shared among all threads in the process.
Memory for program data: The program data is shared among all threads in the process.

Couldn't we make those things per-thread too, for maximum flexibility?
Well, if we did that, our “threads” would basically just be old-school processes, and we'd lose the benefits of having proper threads. The whole point of threads is that they're inside the same process, so they can share memory and communicate more easily.
That said, sometimes threads do need their own specific data. Although that can often be done via local variables and function arguments, sometimes it's more convenient to have a per-thread global data structure, so it's nice to have the option, which means it's something to add to our per-thread list.

Many of the remaining things on our list are shared among all threads in a process in Unix-like operating systems, including

File descriptors: All threads in a process share the same set of open files.
User and Group ID: All threads in a process run with the same user and group permissions.
Current working directory: All threads in a process share the same working directory.
Root directory: All threads in a process share the same root directory.
Signals waiting; signal mask.
Time of next alarm signal.

So, like, if one thread changes the current working directory, all the other threads in the process will see the change?
That's right. In practice, this sharing means that the chdir system call is off-limits in a multithreaded program, because it changes the CWD for everyone and would likely lead to chaos. In some situations (e.g., web servers) this issue can require the use of long file paths to avoid changing the working directory.
Maybe it would have been better to make the current working directory per-thread?
Possibly the whole concept of a single “current working directory” is a bit of a historical artifact, and it would have been better to have a more flexible system.

Things that Unix Decided to Make Per-Thread

Unix-like operating systems also have some per-thread attributes, including

Priority: Each thread has its own priority.
CPU time used: Each thread has its own CPU time usage.
Signal mask

Hay! You said that the signal mask was shared among all threads in a process, but now you're saying it's per-thread. Which is it?
Actually, it's a mess.

Deeper Dive: Signal Handling

Signals are used in a Unix system for two main purposes:

Asynchronous notifications: Signals are used to notify a process that a particular event has occurred, such as a user interrupt (e.g., SIGINT), a child process terminating (e.g., SIGCHLD), or a timer expiring (e.g., SIGALRM).
Synchronous exceptions: Signals are used to handle synchronous exceptions, such as illegal memory access (e.g., SIGSEGV) or division by zero (e.g., SIGFPE).

These elements are, in some sense, quite different things, but the Unix developers thought that it was simpler to have a single mechanism for both—a mechanism that was designed before threads were a thing. But the waters got murkier when threads were introduced.

Asynchronous notifications (especially those about external events, such as the user pressing Ctrl-C or a child process terminating) are something the operating system just wants to tell the process as a whole. There is nothing thread-specific about them. So that requirement suggests that signal handling should be done on a per-process basis.

But synchronous exceptions are caused by an error in a specific thread (e.g., a particular thread divides by zero or makes an illegal memory access). So that reasoning suggests that signal handling should be done on a per-thread basis.

As a result, Unix has some rather strange rules about how signals are handled in multithreaded programs. First, if any asynchronous signal is delivered to a process, it is delivered to one thread in the process (which thread is unspecified!). Synchronous signals are delivered to the thread that caused the exception.

There are two signal masks: the process signal mask and the thread signal mask. The process signal mask is used to block signals for the entire process, whereas the thread signal mask is used to block signals for a specific thread.

Although the signal mask has a per-thread option, the signal action (what happens when a signal is delivered) is per-process. So if a signal is delivered to a thread, the same function is always called, regardless of which thread received the signal. We can't have one thread call one function if there is a division by zero, and another thread call a different one.

Overall, it mostly makes sense, but it's a bit of a mess.

Are there other tough decisions like this that operating systems had to make when adding threading support?
Yes. Let's think about another one.

In a previous lesson, we talked about the fork() system call, which creates a new process by duplicating the current process. Recall that the code looked like

int pid = fork();
if (pid == 0) {
    // Child process
} else {
    // Parent process
}

So the piece of code that calls fork() gets to easily know whether it's the parent or the child process. But what if there are also other threads running? Those threads might be in the middle of doing something else. Should they be duplicated, too? Should these other threads be told that a fork has happened or left in the dark?

What do you think?

I think fork should duplicate all the threads. It's supposed to be an exact duplicate of the original and it's hardly an exact duplicate if some threads are missing.
But wait, what about the other threads? If they're in the middle of something, they might not be in a good state to be duplicated. Like, what if they're writing to a file? Then we'll get two sets of writes, which is probably not what we want.
Good point, Dog. That does seem like an issue.
Okay, so instead we should just duplicate the calling thread. That way, we don't have to worry about the other threads being in a bad state, and we don't have to worry about them being duplicated either.
But what if one of the other threads is holding a lock? If it suddenly vanishes, the lock will never be released in the forked process.
That's a problem, too.
So, what if we just tell the other threads that a fork has happened, and let them decide what to do? Maybe send them a SIGFORK signal or something?
That's an interesting idea, although the details would need to be worked out carefully.
Gah, it doesn't feel like any of these options are great.
Let's look at what POSIX actually does.

Honestly, there are no great options here. POSIX decided to duplicate only the calling thread, and to leave the other threads in the dark that the fork even happened.

Generally, it's considered a bad idea for a multithreaded program to call fork() while other threads are actively working. It's okay, however, if just after forking the child immediately calls exec() to replace its memory space with a new program. That is, after all, the most common use of fork().

This thread stuff is pretty tricky. But at least we're done. We've seen user-space threads that are lightweight compared to native threads, but can't take advantage of multicore machines. And we've seen native threads, which are heavier but can take advantage of multicore machines, even if they were a bit of a retrofit to existing operating systems.
But wait, there's more!

Hybrid Approaches

There are also hybrid approaches that try to combine the best of both worlds. We could, for example, have a user-space thread library that uses native threads behind the scenes. Options for doing this vary:

Synchronization in User Space

One place where thread libraries try to be more efficient is in synchronization. For example, if a thread is waiting for a lock, it might busy-wait for a short while before giving up and going to sleep. This approach can be more efficient than making a system call to the kernel to put the thread to sleep and wake it up again.

But this is code duplication again, right? We're writing our synchronization logic in two places: once in the user-space thread library and once in the kernel.
That's right. It makes the code more complex and harder to maintain. But it can be more efficient.

N:M Threading

Another approach is to have an N:M threading model, where N user-space threads are mapped to M native (kernel-supported) threads. This approach can be more efficient than a 1:1 threading model, because the user-space threads can be scheduled by the user-space thread library, which can be more efficient than the kernel scheduler.

For example, if we have communication channels, and one thread wants to send a message to another thread and wait for the response, we can do that without involving the kernel at all, switching between the sender and receiver entirely in user space.

This is the approach used by the Go programming language, which has a user-space thread library that maps Go routines to native threads. The environment variable GOMAXPROCS can be used to control how many native threads are used.

Fun fact, GOMAXPROCS used to default to 1, which meant that unless you knew about this environment variable, your Go programs would only ever use one core. Go 1.5 changed the default to the number of cores on the machine.

A similar approach is used by Apple's Grand Central Dispatch (GCD) library, which is used in macOS and iOS, and also behind the scenes to support Swift's concurrency model.

The Windows Thread Pool API, which is used by the .NET framework and other Windows libraries, uses a similar approach.

Takeaways

Threads are a way to run multiple tasks concurrently within the same program.
User-space threads are lightweight and can be more efficient, but can't take advantage of multicore machines.
- It's actually not that hard to write a simple user-space–thread library, as we saw in the example code. But writing a full-featured one is quite tricky.
Native threads are heavier because every thread action is a system call, but they can take advantage of multicore machines.
There are hybrid approaches that try to combine the best of both worlds.
There are many design decisions to be made when adding threading support to an existing operating system, and no perfect answers.

(When logged in, completion status appears here.)

Previous Page Next Page

CS 134

Warning: Missing Prerequisites

Threads

Green Threads

Native Threads

Things that Need to be Per-Thread

Things that Need to be Per-Process

Things that Unix Decided to Make Per-Thread

Deeper Dive: Signal Handling

Hybrid Approaches

Synchronization in User Space

N:M Threading

Takeaways

Instructor Controls

Warning: Missing Prerequisites

Threads

Green Threads

Native Threads

Things that Need to be Per-Thread

Things that Need to be Per-Process

Things that Unix Decided to Share

Things that Unix Decided to Make Per-Thread

Deeper Dive: Signal Handling

Hybrid Approaches

Synchronization in User Space

N:M Threading

Takeaways

Instructor Controls