Paging
The Story So Far
We've moved from arranging memory using segments,
neovim with Segmentationto arranging memory using pages,
neovim with PagingTerminology
- Page: A fixed-size chunk of memory that holds code or data.
- 4 KB is a common size for pages, but it isn't the only option. For example, macOS on ARM CPUs uses 16 KB pages.
- Page Table: A data structure that maps logical addresses to physical addresses.
- Page Table Entry: A single entry in the page table that maps a single page.
- Frame: A fixed-size chunk of physical memory that holds a page. (Sometimes called a page frame.)
Mapping Logical Addresses to Physical Addresses
The page table maps each page to a page frame. In this design, we're imagining the table is external to the CPU because it's too big to fit on the chip.
I almost get it, but why are there blue pages in the frames? I don't see any blue pages in the logical address space.
The logical address space is per-process, and each process has its own set of pages. The blue pages are from a different process.
So the page table is different for each process?
Yes, because each process has its own logical address space.
I guess the green is the code and the orange brown is the stack, but why is there a huge gap of white nothingness in the middle?
Those are parts of the logical address space that don't have a page mapped to them. They're not valid addresses. But they provide room for the stack to grow downward and the heap to grow upward.
Hay! Don't you have that backwards?
That's the weird thing about how we represent memory when we draw it on a page. Usually the far end of memory (the “top”) is at the bottom of the page.
Fun with Logical to Physical Address Translation: Aliasing
Given our arrangement of memory, it's technically possible to have multiple pages in our logical address space map to the same frame of physical memory: page aliasing. It's not necessarily particularly useful, but it's possible. If you click the link below, it will take you to a page in OnlineGDB where you can see page aliasing in action.
Created two mappings of the same page:
addr1: 0x78f9f10db000
addr2: 0x78f9f10a1000
Writing first string to addr1...
At address 0x78f9f10db000: "Hello from first mapping!"
At address 0x78f9f10a1000: "Hello from first mapping!"
Writing second string to addr2...
At address 0x78f9f10db000: "Greetings from second mapping!"
At address 0x78f9f10a1000: "Greetings from second mapping!"
Improving Efficiency
Because the page table is large, we placed it off-chip in memory. So every memory access actually requires two memory accesses: one to get the page-table entry and another to get the data. Which is slow.
One approach to making address translation more efficient is to add a cache that remembers a small number of recently used page-table entries. In a sane world, this feature would be called the address translation cache, but in the world of computer science, it's called the translation lookaside buffer (TLB), in part because it was invented before people started using the term “cache”.
The diagram below shows how the TLB fits into the picture. We're looking for page 4, which happens to be in frame 3. We could find that by looking in the page table, but since it's in the TLB, we can skip that step.
So the TLB is like a cache for the page table?
Yes, exactly. It's a cache for the most recently used page-table entries.
But the page table still has lots of spaces in it, which seems wasteful. And this diagram only shows 16 entries, but on a real machine with 32-bit addressing, we said the range of logical addresses amounts to about a million pages.
Let's sort that out.
Saving Memory with Multi-Level Page Tables
First, let's take a look at a slightly bigger page table.
Hay! Why haven't we put any pages at the very beginning of memory?
If we leave address zero unmapped, it can catch null-pointer dereferences, as the null pointer is usually represented as address zero. If we put anything in memory there, we'd be able to actually read something (even though it's wrong), so the program wouldn't crash. But we want programs to crash if they dereference a null pointer. So an empty slot at address zero is sometimes called a guard page.
So we're using the page table to help us catch bugs? So cool!
We can make two (other) observations about this page table:
- It's sparse. Many of the entries are empty.
- Pages are allocated in contiguous regions, so the page table tends to have long runs of unused entries.
Which suggests a way we could shorten our page table—and use less memory!
Notice that the page table has a blue line every eight entries, dividing the table into blocks of eight pages.
If we look at each block of pages, we can see that some blocks have pages in use (darker colors), and some do not. Since we only care about the memory that's in use, we can drop the empty blocks, which leaves us with three separate eight-slot page tables with pointers to physical memory frames.
Now we add a new eight-slot upper-level page table whose in-use slots point to the appropriate subtable, giving us a hierarchical or multi-level page table.
(Notice that the two lower-level page tables with purple blocks are contiguous in the upper-level page table.
We're totally depending on having long runs of empty entries in the page table to make this work. If we don't have that, we're sunk. The worst case would be if we always had exactly seven pages between used pages.
But let's look at the positives. In our example, there is a nice big gap, and the top-level table still has a run of empty entries. Could we go even further?
Yes. Remember that what we're looking at in our diagrams is much scaled down from what you'd see in a real system.
A Real Process Memory Map
Let's look at a real-world process. Here's a memory map for Neovim, a popular text editor, running on a Raspberry Pi Zero (a 32-bit system). The code and data are at the bottom of memory (top of the diagram), followed by the heap and then a huge gap, and then the stack. Afterwards (in blue), there are a number of shared libraries that Neovim uses. The top quarter of the address space is reserved for the kernel.
(Because sizes vary dramatically, the diagram is not drawn to scale—in fact, the height of each block is based on the cube root of the size of the block. If you hover over any block, it'll tell you more details about it.)
So the kernel memory is part of the page table?
Not while user-space code is running. The kernel has its own page table, and it's mapped into the top quarter of the address space when the kernel is running. When user-space code is running, the kernel's memory is not accessible.
So user programs are restricted to at most 3 GB of memory on a 32-bit system?
On Linux, with the standard kernel configuration, yes. This arrangment makes
copyinandcopyoutoperations much simpler, as the kernel can just map the user-space memory into its own address space and access it directly. If we're willing to makecopyinandcopyoutmore complex, we could give user-space programs access to the full 4 GB of address space. FWIW, this this latter approach was used by Mac OS X on 32-bit systems.
Realistic Multi-Level Page Tables (32-bit Addressing)
On a 32-bit system, we can use the two-level approach for page tables. We have 32 bits of address with 4 KB pages, so we'll break things down as follows:
- The low 12 bits will be the offset within the page.
- The remaining 20 bits are the page number, but we break those 20 bits into two 10-bit chunks:
- The top 10 bits will be the index into the top-level table.
- The next 10 bits will be the index into the second-level table.
Thus, for example, the address 0x016980d0 (representing an address inside Neovim's heap) is encoded in binary as
0000 0001 0110 1001 1000 0000 1101 0000; thus,- The top 10 bits, representing the upper-level page number are
0000 0001 01or0x005; - The next 10 bits, representing the lower-level page number are
10 1001 1000or0x298; - The low 12 bits are the offset within the page,
0000 1101 0000or0x0d0.
- The top 10 bits, representing the upper-level page number are
How big is one of these tables?
Each table has 1024 entries (210). If we can fit all the data we need into four bytes per entry, then each table will be 4 KB, which is also the size of a page, so very convenient. We can imagine each entry in the second-level table has
- 24 bits for the frame number, allowing 224 frames (since a frame is 4 KB, that's 64 GB of physical memory).
- 8 bits for flags, like whether the page is present or read-only.
Meh, only 64 GB of memory? Weak.
64 GB of memory on a 32-bit system is actually pretty good. The Raspberry Pi Zero only has 512 MB of memory.
Realistic Multi-Level Page Tables (64-bit Addressing)
The architecture of an x86_64 system supports 64-bit addresses, but current implementations typically use 48-bit virtual addresses (the high 16 bits must be sign-extended from bit 47). With 4 KB pages (like our 32-bit example), the low 12 bits are still the page offset, which leaves us with 36 bits for the page number, which we break into four 9-bit chunks. These chunks serve as indices into a four-level page table hierarchy, often called the PGD (Page Global Directory), PUD (Page Upper Directory), PMD (Page Middle Directory), and PTE (Page Table Entry). Each 9-bit index allows 512 entries per table, and, with 8-byte entries, each table consumes 4 KB.
For example, the address 0x00007f3144816042 (a typical location for shared library code) is encoded in binary as
000000000011111110001100010010000010000000100100; thus,- The L4 (PGD) index bits are
000000000or0x000; - The L3 (PUD) index bits are
011111110or0x1FE; - The L2 (PMD) index bits are
001100010or0x062; - The L1 (PTE) index bits are
010000010or0x082; - The page offset bits are
000000100100or0x042.
- The L4 (PGD) index bits are
Note that the high 16 bits (not shown) must be either all 0s or all 1s, matching bit 47, forming the “canonical form” required by x86_64. So we have a 256 TB virtual address space, split evenly between user and kernel space.
Meh. Only 256 terabytes of address space? Weak.
Actually, x86_64 has extended its virtual address space beyond 48 bits with Page Map Level 5 (PML5) tables, adding another 9-bit index to reach 57 bits.
Why did it drop from 10-bits per level to 9-bits?
On the 32-bit chip, a pointer is 4 bytes, so we can fit 1024 pointers in a page. On the 64-bit chip, a pointer is 8 bytes, so we can fit 512 = 29 pointers in a page. So it makes sense to have 9-bit indices.
Aside: ARM64 Page Tables
ARM64 systems like Apple Silicon also use 48-bit virtual addressing by default (though the architecture supports 52-bit addresses with an optional feature), but handle it slightly differently from x86_64. Instead of four 9-bit fields, ARM64 uses a three-level page table by default, with a 16 KB page size splitting the address into
- 14 bits of page offset (for 16 KB pages)
- Three 11-bit chunks for the page table indices
For example, the address 0x0000004208016000 would break down as
000001000010000010000000010110000 00000000000000; thus,- The L3 index bits are
00000100001or0x021; - The L2 index bits are
00000100000or0x020; - The L1 index bits are
00010110000or0x0B0; - The page offset bits are
00000000000000or0x0000.
- The L3 index bits are
Each table has 2048 entries (211), and with 8-byte entries, each table takes up 16 KB. ARM64 can be configured to use different page sizes, with the page table structure adjusting accordingly.
The ARM64 architecture also supports 4 KB pages with four-level page tables, similar to x86_64, or can use 64 KB pages with three-level page tables supporting 52-bit addresses.
Like x86_64, ARM64 requires addresses to be in canonical form, with the unused high bits matching bit 47 (or bit 51 in 52-bit mode).
(When logged in, completion status appears here.)