CS 134

Dynamic Linking and Shared Libraries

  • Pig speaking

    In the last section, we saw that statically linking in the C library made our “Hello World” program HUGE on Linux! I want to know MORE about how we can make that better.

  • PinkRobot speaking

    That's right! And while we saw some ways to make that better (like using musl), there's another approach entirely: sharing the library code between programs.

The Problem with Static Linking

Let's imagine a world where all the programs on your computer are statically linked. What would that look like…?

$ ls -lh /bin/cat /bin/find /bin/grep /bin/ls /bin/python3
-rwxr-xr-x 1 root root 843K Apr  5  2024 /bin/cat
-rwxr-xr-x 1 root root 1.5M Apr  8  2024 /bin/find
-rwxr-xr-x 1 root root 1.7M Apr  8  2024 /bin/grep
-rwxr-xr-x 1 root root 1.3M Apr  5  2024 /bin/ls
-rwxr-xr-x 1 root root 9.4M Sep 11 07:17 /bin/python3
-rwxr-xr-x 1 root root 879K Apr  8  2024 /bin/xargs

If we run a command pipeline like

$ find . -name "*.c" | grep TODO | xargs cat

we're running four programs (find, grep, xargs, and cat). The combined code for all of them is 1.5 MB + 1.7 MB + 879 KB + 843 KB = 4.75 MB, all for something quite simple! The bulk of that space is four separate copies of the C library code!

  • Duck speaking

    That's a lot of wasted space!

  • Hedgehog speaking

    Are they exactly the same?

  • PinkRobot speaking

    Although they have a lot in common, each one is using a different set of functions from the library. So there's a lot of common code, but each one is a little different.

  • Duck speaking

    I've got an idea. We link the program as usual, but we decide ahead of time where all the libraries are going to live (maybe up high in memory somewhere).

  • Dog speaking

    And then we just put the addresses of the library functions right in the program code using the linker's relocation records just like we did before!

  • Horse speaking

    Whoa! And then we don't include the library code in the program, we just put some magic to say “map in the library at the agreed-upon address”, and then everyone can share it!

Do you see any problems with this approach? (Hint: What about when a new version of the library comes out?)

The problem with fixing the addresses of library functions in the program is that if the library changes, the addresses will change, too. So we'd have to recompile every program that uses the library every time the library changes. That's not very practical!

  • Cat speaking

    Thinking about it, it's not going to work well with Address Space Layout Randomization (ASLR) either. If we want to move the library around in memory (for security reasons), then we can't hardcode the addresses of the library functions in the program because they'll change every time we load the program.

A Second Try: Load-Time Relocation

  • Duck speaking

    We could just run the linker every time we load the program. It can use the relocation records to fix up the addresses then. Instead of outputting the linked program to a file, it could just fix up the addresses in memory.

  • Dog speaking

    So it would be live patching the code! Love it!

Is this a good idea? (Hint: There are issues related to time and issues related to space.)

The big problem with running the linker every time we load a program is that it would slow things down. We'd have to read in the program, read in the libraries, patch the code in memory to fix up the addresses, and only then start running the program. Full-blown linking is often quite a complex process, and we'd be doing it every time we load a program.

There's also the question of sharing code between multiple instances of the same program. With our prelinked programs, the program code was read-only and could be read into memory directly from the executable file. But if we're going to be patching the code every time we load it, the code might not be so easy to share. We also can't demand-page the code out of the executable file; we'd have to swap the writeable pages to swap space if we didn't want those pages in memory.

What We Want…

Let's pause to review what we want from a shared-library system:

  • Memory Efficiency: We want to share library code between programs to save memory.
  • Speed: We don't want to slow down program startup (too much).
  • Flexibility: We want to be able to update libraries without recompiling all programs.
  • Security: We want to be able to move libraries around in memory (ASLR), rather than always having them at the same address.
  • Read-Only Code: We want to share program code between processes and not patch it in memory.
  • Cat speaking

    I'm not sure we can have all these things at once. If we don't know ahead of time where the library is going to be, we're going to have to patch up something so we can find the library functions.

  • PinkRobot speaking

    True. But the trick is to minimize the amount of patching we have to do…

A Better Way: Indirect Calls

Instead of modifying our code to call library functions directly, what if we

  1. Made a table of function addresses.
  2. Have the code call through the table.
  3. Only fix up the table entries when we load the program.

That table is called the Global Offset Table (GOT). To use this approach, our compiler needs to transform our hello program to use the GOT. Here's pseudocode for what the compiler would turn our hello program into:

#include <unistd.h>

struct got_entry {
        void *address;
        char *symbol_name;
};

__attribute__((section(".data.got")))
struct got_entry got[] = {
        { 0, "write" },
        { 0, "printf" },
        { 0, "exit" },
        { 0, NULL }
};

typedef ssize_t (*write_func)(int, const void *, size_t);
typedef int (*printf_func)(const char *, ...);
typedef void (*exit_func)(int);

int 
main()
{
        ((write_func)got[0].address)(1, "Hello World!\n", 13);
        ((printf_func)got[1].address)("... and hello to dynamic linking!\n");
        ((exit_func)got[2].address)(0);
        return 0;
}


/* Below is some magic code to actually fill in the GOT. You don't need
 * to worry too much about it, only that it's run when the program is
 * loaded.  We'll imagine that the dynamic linker does this for us. */

#include <dlfcn.h>

#ifndef RTLD_DEFAULT
    #define RTLD_DEFAULT 0
#endif

// Magic constructor function to fill in the GOT, using dlsym and GCC magic attributes
void __attribute__((constructor)) 
initialize_got()
{
        for (int i = 0; got[i].symbol_name != NULL; i++) {
                got[i].address = dlsym(RTLD_DEFAULT, got[i].symbol_name);
        }
}

Thus, main compiles to

main:
        endbr64
        subq    $8, %rsp
        movl    $13, %edx
        movl    $.LC0, %esi
        movl    $1, %edi
        call    *got(%rip)      # Call write
        movl    $.LC1, %edi
        xorl    %eax, %eax
        call    *got+16(%rip)   # Call printf
        xorl    %edi, %edi
        call    *got+32(%rip)   # Call exit
        xorl    %eax, %eax
        addq    $8, %rsp
        ret

So long as we know where got is (which we do, since it's inside the executable), we can find the addresses of the library functions. And because the global offset table is just data, we can write to it when we load the program.

  • Cat speaking

    I see! The code doesn't change, just the table entries.

  • PinkRobot speaking

    Right!

  • Hedgehog speaking

    I hope we don't have to write this horrible, ugly code… Right?

  • PinkRobot speaking

    No. If the operating system adopts this approach, we'd make sure the compiler did the dirty work for us so we could write code the same way we always have.

And so:

  • The code can be read-only and shared.
  • We only need to write to data pages (the GOT).
  • Each process has its own GOT with the right addresses.
  • Linking is relatively fast (just update the GOT by looking up all the symbols we need).
  • Dog speaking

    So we've got memory efficiency, speed, flexibility, and security! I like it!

  • Goat speaking

    Meh. You've still got to look everything up before you can run the program. That's going to slow things down.

  • PinkRobot speaking

    Okay, let's fix that, too.

Lazy Binding: Why Fix Everything at Once?

There's another way we could do things, at least for functions. Instead of a table of data, we could have a table of code (called function stubs). After the dynamic linker has run, it's as if we'd written

.text.plt:
write_jump:
        jmp     write
printf_jump:
        jmp     printf
exit_jump:
        jmp     exit

But initially, the table would look like

.text.plt:
write_jump:
        jmp     0
printf_jump:
        jmp     0
exit_jump:
        jmp     0

We'd expect the linker to patch up the table entries when we load the program.

  • Cat speaking

    I see… So the program would call printf_jump, which would jump to printf (once the linker has fixed it up).

  • Horse speaking

    Hay! Didn't you want to avoid patching code?

  • PinkRobot speaking

    Well, this is just a tiny bit of code in a special section that we're patching. It's not like we're patching the whole program.

  • Duck speaking

    Okay, but this method doesn't seem like it's much better than the GOT. We still have to look up all the addresses before we can run the program.

  • PinkRobot speaking

    Ah, but do we…?

The Trick

What we actually make the table entries look like initially is

.text.plt:
write_jump:
        call    dyld_fixup
printf_jump:
        call    dyld_fixup
exit_jump:
        call    dyld_fixup

.rodata.symnames:
S1:     .asciz  "write"
S2:     .asciz  "printf"
S3:     .asciz  "exit"

.rodata.symlist:
        .quad   S1
        .quad   S2
        .quad   S3
        .quad   0

Can you see the trick? What does dyld_fixup do?

The trick is that dyld_fixup looks up the symbol name in the symbol list, finds the address of the function, and then patches the code, replacing the call to dyld_fixup with a jump to the actual function in the library. With this approach, we don't have to look up the address of the function until we actually need it.

  • Duck speaking

    Wait—how did it know what the right function was?

  • PinkRobot speaking

    Because we used a call instruction, we pushed the return address onto the stack. dyld_fixup can look at the return address to figure out where the call instruction that needs fixing is located. It can then use how far that instruction is from the start of the PLT section to figure out which function it's calling and look in the .data.symlist to find the name of the function it needs to look up.

  • Hedgehog speaking

    PLT?

This approach is either called lazy binding (on macOS) or the Procedure Linkage Table (PLT) (on Linux). It's a way to avoid looking up the addresses of library functions until we actually need them.

  • Duck speaking

    And is that what Linux and macOS do?

  • PinkRobot speaking

    Pretty much. Let's take a look!

In Practice…

Let's compile this program and see what it looks like!

#include <unistd.h>
#include <stdio.h>

int 
main()
{
        write(1, "Hello World!\n", 13);
        printf("... and hello to dynamic linking!\n");
        exit(0);
}

The code for main looks like

main:
        endbr64
        subq    $8, %rsp
        movl    $13, %edx
        leaq    .LC0(%rip), %rax
        movq    %rax, %rsi
        movl    $1, %edi
        call    write@PLT
        leaq    .LC1(%rip), %rax
        movq    %rax, %rdi
        movl    $0, %eax
        call    printf@PLT
        movl    $0, %edi
        call    exit@PLT

So you can see here that it's calling little stub functions like we suggested (except that they're called write@PLT, printf@PLT, and exit@PLT rather than write_jump, printf_jump, and exit_jump), and then those functions will call the real functions, figuring out where the functions really are on the first call.

  • Rabbit speaking

    These days both macOS and Linux don't actually patch the stub code (even though there's not much to patch); instead they have a hybrid approach that mixes looking things up in the GOT with calling a special helper function to load the right value into the GOT.

  • PinkRobot speaking

    Right, but we don't need to delve that deeply into those details.

  • Rabbit speaking

    Unless you want to…

The Real PLT Sections

As we saw above, the code in main is very similar to what we had in our design—we call the write@PLT, printf@PLT, and exit@PLT functions. But what do those functions actually look like?

$ objdump -j .plt.sec -D -r  hello-dyn2
Disassembly of section .plt.sec:

0000000000001070 <write@plt>:
    1070:       f3 0f 1e fa             endbr64
    1074:       ff 25 46 2f 00 00       jmp    *0x2f46(%rip)        # 3fc0 <write@GLIBC_2.2.5>
    107a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000001080 <printf@plt>:
    1080:       f3 0f 1e fa             endbr64
    1084:       ff 25 3e 2f 00 00       jmp    *0x2f3e(%rip)        # 3fc8 <printf@GLIBC_2.2.5>
    108a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000001090 <exit@plt>:
    1090:       f3 0f 1e fa             endbr64
    1094:       ff 25 36 2f 00 00       jmp    *0x2f36(%rip)        # 3fd0 <exit@GLIBC_2.2.5>
    109a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

We can see that they are jump instructions, but they're indirect jumps (indicated by the *) to entries in the Global Offset Table (GOT).

So, first off, the Linux approach is compatible with our original non-lazy scheme where we'd fill in the GOT completely when the program loads. In fact, that's the approach Linux used to use, but it now uses a hybrid approach where the GOT entries aren't all filled in at once, but are instead filled in as needed with the help of another part of the PLT system that contains little helper functions:

$ objdump -j .plt -D -r  hello-dyn2

Disassembly of section .plt:

0000000000001020 <.plt>:
    1020:   ff 35 8a 2f 00 00       push   0x2f8a(%rip)        # 3fb0 <_GLOBAL_OFFSET_TABLE_+0x8>
    1026:   ff 25 8c 2f 00 00       jmp    *0x2f8c(%rip)        # 3fb8 <_GLOBAL_OFFSET_TABLE_+0x10>
    102c:   0f 1f 40 00             nopl   0x0(%rax)
  write@plt_helper:
    1030:   f3 0f 1e fa             endbr64
    1034:   68 00 00 00 00          push   $0x0
    1039:   e9 e2 ff ff ff          jmp    1020 <_init+0x20>
    103e:   66 90                   xchg   %ax,%ax
  printf@plt_helper:
    1040:   f3 0f 1e fa             endbr64
    1044:   68 01 00 00 00          push   $0x1
    1049:   e9 d2 ff ff ff          jmp    1020 <_init+0x20>
    104e:   66 90                   xchg   %ax,%ax
  exit@plt_helper:
    1050:   f3 0f 1e fa             endbr64
    1054:   68 02 00 00 00          push   $0x2
    1059:   e9 c2 ff ff ff          jmp    1020 <_init+0x20>
    105e:   66 90                   xchg   %ax,%ax

Each of these little helpers pushes a number corresponding to the desired function (0 = write, 1 = printf, 2 = exit) onto the stack and jumps to a mystery routine at 1020. This routine is analgous to the dyld_fixup function we described earlier. It figures out which function we want, looks up its name in the dynamic symbol table, and then updates the GOT entry to point to the real function.

  • Cat speaking

    So once each helper fixes up its assocated GOT entry, it'll never be called again?

  • PinkRobot speaking

    That's right.

  • Dog speaking

    It's like a dance of indirection! Or a game of peek-a-boo! Now you see me, now you don't!

  • Hedgehog speaking

    All this indirection is making my head spin!

  • Goat speaking

    Meh. I don't think we're going to ever need to know this stuff in this much detail.

Which statement about the Global Offset Table (GOT) and Procedure Linkage Table (PLT) is true?

Back to Memory Usage

  • Hedgehog speaking

    Phew! That felt a like a lot of complexity. Does it actually help?

  • PinkRobot speaking

    Let's see…

Remember the sizes of those executables from before? Let's look at the savings; or, to put it another way, how much worse static linking is compared to dynamic linking:

Program Dynamic Size (KB) Static Size (KB) Blowup (×)
cat 38.5 842.6 22.9
find 199.5 1476.8 7.4
grep 182.4 1664.3 9.1
ls 139.0 1319.5 9.5
python3 7831.2 9416.7 1.2
xargs 62.4 879.1 14.1

You can see that, especially for small programs, the dynamic-linking approach saves a lot of space. And even for larger programs, the savings are significant.

So now, if we run a command like

$ find . -name "*.c" | grep TODO | xargs cat

It's only going to require 38.5 KB + 182.4 KB + 62.4 KB + 139.0 KB = 422.3 KB, rather than 4.75 MB! A tenfold reduction in memory usage!

  • Dog speaking

    Nice! We get to have our cake and eat it too!

  • Goat speaking

    Meh. All this indirection though the global-offset table must slow things down.

  • PinkRobot speaking

    The overhead is pretty small, especially with modern CPU-branch prediction. And the memory savings are huge!

(When logged in, completion status appears here.)