Dynamic Linking and Shared Libraries
In the last section, we saw that statically linking in the C library made our “Hello World” program HUGE on Linux! I want to know MORE about how we can make that better.
That's right! And while we saw some ways to make that better (like using
musl), there's another approach entirely: sharing the library code between programs.
The Problem with Static Linking
Let's imagine a world where all the programs on your computer are statically linked. What would that look like…?
$ ls -lh /bin/cat /bin/find /bin/grep /bin/ls /bin/python3
-rwxr-xr-x 1 root root 843K Apr 5 2024 /bin/cat
-rwxr-xr-x 1 root root 1.5M Apr 8 2024 /bin/find
-rwxr-xr-x 1 root root 1.7M Apr 8 2024 /bin/grep
-rwxr-xr-x 1 root root 1.3M Apr 5 2024 /bin/ls
-rwxr-xr-x 1 root root 9.4M Sep 11 07:17 /bin/python3
-rwxr-xr-x 1 root root 879K Apr 8 2024 /bin/xargs
If we run a command pipeline like
$ find . -name "*.c" | grep TODO | xargs cat
we're running four programs (find, grep, xargs, and cat). The combined code for all of them is 1.5 MB + 1.7 MB + 879 KB + 843 KB = 4.75 MB, all for something quite simple! The bulk of that space is four separate copies of the C library code!
That's a lot of wasted space!
Are they exactly the same?
Although they have a lot in common, each one is using a different set of functions from the library. So there's a lot of common code, but each one is a little different.
A First Try: Link with Absolute Addresses
I've got an idea. We link the program as usual, but we decide ahead of time where all the libraries are going to live (maybe up high in memory somewhere).
And then we just put the addresses of the library functions right in the program code using the linker's relocation records just like we did before!
Whoa! And then we don't include the library code in the program, we just put some magic to say “map in the library at the agreed-upon address”, and then everyone can share it!
The problem with fixing the addresses of library functions in the program is that if the library changes, the addresses will change, too. So we'd have to recompile every program that uses the library every time the library changes. That's not very practical!
Thinking about it, it's not going to work well with Address Space Layout Randomization (ASLR) either. If we want to move the library around in memory (for security reasons), then we can't hardcode the addresses of the library functions in the program because they'll change every time we load the program.
A Second Try: Load-Time Relocation
We could just run the linker every time we load the program. It can use the relocation records to fix up the addresses then. Instead of outputting the linked program to a file, it could just fix up the addresses in memory.
So it would be live patching the code! Love it!
The big problem with running the linker every time we load a program is that it would slow things down. We'd have to read in the program, read in the libraries, patch the code in memory to fix up the addresses, and only then start running the program. Full-blown linking is often quite a complex process, and we'd be doing it every time we load a program.
There's also the question of sharing code between multiple instances of the same program. With our prelinked programs, the program code was read-only and could be read into memory directly from the executable file. But if we're going to be patching the code every time we load it, the code might not be so easy to share. We also can't demand-page the code out of the executable file; we'd have to swap the writeable pages to swap space if we didn't want those pages in memory.
What We Want…
Let's pause to review what we want from a shared-library system:
- Memory Efficiency: We want to share library code between programs to save memory.
- Speed: We don't want to slow down program startup (too much).
- Flexibility: We want to be able to update libraries without recompiling all programs.
- Security: We want to be able to move libraries around in memory (ASLR), rather than always having them at the same address.
- Read-Only Code: We want to share program code between processes and not patch it in memory.
I'm not sure we can have all these things at once. If we don't know ahead of time where the library is going to be, we're going to have to patch up something so we can find the library functions.
True. But the trick is to minimize the amount of patching we have to do…
A Better Way: Indirect Calls
Instead of modifying our code to call library functions directly, what if we
- Made a table of function addresses.
- Have the code call through the table.
- Only fix up the table entries when we load the program.
That table is called the Global Offset Table (GOT). To use this approach, our compiler needs to transform our hello program to use the GOT. Here's pseudocode for what the compiler would turn our hello program into:
#include <unistd.h>
struct got_entry {
void *address;
char *symbol_name;
};
__attribute__((section(".data.got")))
struct got_entry got[] = {
{ 0, "write" },
{ 0, "printf" },
{ 0, "exit" },
{ 0, NULL }
};
typedef ssize_t (*write_func)(int, const void *, size_t);
typedef int (*printf_func)(const char *, ...);
typedef void (*exit_func)(int);
int
main()
{
((write_func)got[0].address)(1, "Hello World!\n", 13);
((printf_func)got[1].address)("... and hello to dynamic linking!\n");
((exit_func)got[2].address)(0);
return 0;
}
/* Below is some magic code to actually fill in the GOT. You don't need
* to worry too much about it, only that it's run when the program is
* loaded. We'll imagine that the dynamic linker does this for us. */
#include <dlfcn.h>
#ifndef RTLD_DEFAULT
#define RTLD_DEFAULT 0
#endif
// Magic constructor function to fill in the GOT, using dlsym and GCC magic attributes
void __attribute__((constructor))
initialize_got()
{
for (int i = 0; got[i].symbol_name != NULL; i++) {
got[i].address = dlsym(RTLD_DEFAULT, got[i].symbol_name);
}
}
Thus, main compiles to
main:
endbr64
subq $8, %rsp
movl $13, %edx
movl $.LC0, %esi
movl $1, %edi
call *got(%rip) # Call write
movl $.LC1, %edi
xorl %eax, %eax
call *got+16(%rip) # Call printf
xorl %edi, %edi
call *got+32(%rip) # Call exit
xorl %eax, %eax
addq $8, %rsp
ret
So long as we know where got is (which we do, since it's inside the executable), we can find the addresses of the library functions. And because the global offset table is just data, we can write to it when we load the program.
I see! The code doesn't change, just the table entries.
Right!
I hope we don't have to write this horrible, ugly code… Right?
No. If the operating system adopts this approach, we'd make sure the compiler did the dirty work for us so we could write code the same way we always have.
And so:
- The code can be read-only and shared.
- We only need to write to data pages (the GOT).
- Each process has its own GOT with the right addresses.
- Linking is relatively fast (just update the GOT by looking up all the symbols we need).
So we've got memory efficiency, speed, flexibility, and security! I like it!
Meh. You've still got to look everything up before you can run the program. That's going to slow things down.
Okay, let's fix that, too.
Lazy Binding: Why Fix Everything at Once?
There's another way we could do things, at least for functions. Instead of a table of data, we could have a table of code (called function stubs). After the dynamic linker has run, it's as if we'd written
.text.plt:
write_jump:
jmp write
printf_jump:
jmp printf
exit_jump:
jmp exit
But initially, the table would look like
.text.plt:
write_jump:
jmp 0
printf_jump:
jmp 0
exit_jump:
jmp 0
We'd expect the linker to patch up the table entries when we load the program.
I see… So the program would call
printf_jump, which would jump toprintf(once the linker has fixed it up).
Hay! Didn't you want to avoid patching code?
Well, this is just a tiny bit of code in a special section that we're patching. It's not like we're patching the whole program.
Okay, but this method doesn't seem like it's much better than the GOT. We still have to look up all the addresses before we can run the program.
Ah, but do we…?
The Trick
What we actually make the table entries look like initially is
.text.plt:
write_jump:
call dyld_fixup
printf_jump:
call dyld_fixup
exit_jump:
call dyld_fixup
.rodata.symnames:
S1: .asciz "write"
S2: .asciz "printf"
S3: .asciz "exit"
.rodata.symlist:
.quad S1
.quad S2
.quad S3
.quad 0
The trick is that dyld_fixup looks up the symbol name in the symbol list, finds the address of the function, and then patches the code, replacing the call to dyld_fixup with a jump to the actual function in the library. With this approach, we don't have to look up the address of the function until we actually need it.
Wait—how did it know what the right function was?
Because we used a
callinstruction, we pushed the return address onto the stack.dyld_fixupcan look at the return address to figure out where thecallinstruction that needs fixing is located. It can then use how far that instruction is from the start of the PLT section to figure out which function it's calling and look in the.data.symlistto find the name of the function it needs to look up.
PLT?
This approach is either called lazy binding (on macOS) or the Procedure Linkage Table (PLT) (on Linux). It's a way to avoid looking up the addresses of library functions until we actually need them.
And is that what Linux and macOS do?
Pretty much. Let's take a look!
In Practice…
Let's compile this program and see what it looks like!
#include <unistd.h>
#include <stdio.h>
int
main()
{
write(1, "Hello World!\n", 13);
printf("... and hello to dynamic linking!\n");
exit(0);
}
The code for main looks like
main:
endbr64
subq $8, %rsp
movl $13, %edx
leaq .LC0(%rip), %rax
movq %rax, %rsi
movl $1, %edi
call write@PLT
leaq .LC1(%rip), %rax
movq %rax, %rdi
movl $0, %eax
call printf@PLT
movl $0, %edi
call exit@PLT
So you can see here that it's calling little stub functions like we suggested (except that they're called write@PLT, printf@PLT, and exit@PLT rather than write_jump, printf_jump, and exit_jump), and then those functions will call the real functions, figuring out where the functions really are on the first call.
These days both macOS and Linux don't actually patch the stub code (even though there's not much to patch); instead they have a hybrid approach that mixes looking things up in the GOT with calling a special helper function to load the right value into the GOT.
Right, but we don't need to delve that deeply into those details.
Unless you want to…
The Real PLT Sections
As we saw above, the code in main is very similar to what we had in our design—we call the write@PLT, printf@PLT, and exit@PLT functions. But what do those functions actually look like?
$ objdump -j .plt.sec -D -r hello-dyn2
Disassembly of section .plt.sec:
0000000000001070 <write@plt>:
1070: f3 0f 1e fa endbr64
1074: ff 25 46 2f 00 00 jmp *0x2f46(%rip) # 3fc0 <write@GLIBC_2.2.5>
107a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000001080 <printf@plt>:
1080: f3 0f 1e fa endbr64
1084: ff 25 3e 2f 00 00 jmp *0x2f3e(%rip) # 3fc8 <printf@GLIBC_2.2.5>
108a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000001090 <exit@plt>:
1090: f3 0f 1e fa endbr64
1094: ff 25 36 2f 00 00 jmp *0x2f36(%rip) # 3fd0 <exit@GLIBC_2.2.5>
109a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
We can see that they are jump instructions, but they're indirect jumps (indicated by the *) to entries in the Global Offset Table (GOT).
So, first off, the Linux approach is compatible with our original non-lazy scheme where we'd fill in the GOT completely when the program loads. In fact, that's the approach Linux used to use, but it now uses a hybrid approach where the GOT entries aren't all filled in at once, but are instead filled in as needed with the help of another part of the PLT system that contains little helper functions:
$ objdump -j .plt -D -r hello-dyn2
Disassembly of section .plt:
0000000000001020 <.plt>:
1020: ff 35 8a 2f 00 00 push 0x2f8a(%rip) # 3fb0 <_GLOBAL_OFFSET_TABLE_+0x8>
1026: ff 25 8c 2f 00 00 jmp *0x2f8c(%rip) # 3fb8 <_GLOBAL_OFFSET_TABLE_+0x10>
102c: 0f 1f 40 00 nopl 0x0(%rax)
write@plt_helper:
1030: f3 0f 1e fa endbr64
1034: 68 00 00 00 00 push $0x0
1039: e9 e2 ff ff ff jmp 1020 <_init+0x20>
103e: 66 90 xchg %ax,%ax
printf@plt_helper:
1040: f3 0f 1e fa endbr64
1044: 68 01 00 00 00 push $0x1
1049: e9 d2 ff ff ff jmp 1020 <_init+0x20>
104e: 66 90 xchg %ax,%ax
exit@plt_helper:
1050: f3 0f 1e fa endbr64
1054: 68 02 00 00 00 push $0x2
1059: e9 c2 ff ff ff jmp 1020 <_init+0x20>
105e: 66 90 xchg %ax,%ax
Each of these little helpers pushes a number corresponding to the desired function (0 = write, 1 = printf, 2 = exit) onto the stack and jumps to a mystery routine at 1020. This routine is analgous to the dyld_fixup function we described earlier. It figures out which function we want, looks up its name in the dynamic symbol table, and then updates the GOT entry to point to the real function.
So once each helper fixes up its assocated GOT entry, it'll never be called again?
That's right.
It's like a dance of indirection! Or a game of peek-a-boo! Now you see me, now you don't!
All this indirection is making my head spin!
Meh. I don't think we're going to ever need to know this stuff in this much detail.
Back to Memory Usage
Phew! That felt a like a lot of complexity. Does it actually help?
Let's see…
Remember the sizes of those executables from before? Let's look at the savings; or, to put it another way, how much worse static linking is compared to dynamic linking:
| Program | Dynamic Size (KB) | Static Size (KB) | Blowup (×) |
|---|---|---|---|
cat |
38.5 | 842.6 | 22.9 |
find |
199.5 | 1476.8 | 7.4 |
grep |
182.4 | 1664.3 | 9.1 |
ls |
139.0 | 1319.5 | 9.5 |
python3 |
7831.2 | 9416.7 | 1.2 |
xargs |
62.4 | 879.1 | 14.1 |
You can see that, especially for small programs, the dynamic-linking approach saves a lot of space. And even for larger programs, the savings are significant.
So now, if we run a command like
$ find . -name "*.c" | grep TODO | xargs cat
It's only going to require 38.5 KB + 182.4 KB + 62.4 KB + 139.0 KB = 422.3 KB, rather than 4.75 MB! A tenfold reduction in memory usage!
Nice! We get to have our cake and eat it too!
Meh. All this indirection though the global-offset table must slow things down.
The overhead is pretty small, especially with modern CPU-branch prediction. And the memory savings are huge!
(When logged in, completion status appears here.)