Stranger Paging Options
It does seem like there's going to be a lot of overhead for paging. Each process needs its own page table, so we have a complete table for each process. That's a lot of memory overhead, right?
In our 32-bit system example, each 4 KB page of memory needs a 4-byte page-table entry, so that's 0.1% overhead, which isn't much.
I don't think multi-level page tables are that smart a way to do it though. In CS 70 we learned about data structures like hash tables and binary search trees. Why not use one of those?
Remember, the data structure has to be directly supported by the hardware, so it can't be that complicated. But people have thought about other options, and that's what we're going to talk about now.
Hashed Page Tables
Let's look at 32-bit PowerPC CPU, which merged the ideas of segmentation and paging. Each logical address is divided into three parts:
- 4 bits for the segment number
- 16 bits for the page number
- 12 bits for the offset
The segment number then maps to a 16-entry segment table, which provides 24 bits to represent the top bits of a global logical address, so the global logical address is 24 + 16 + 12 = 52 bits.
A single global hash table maps global page numbers to physical page numbers. The hash table is indexed by the global page number, and each entry contains the physical page number. This arrangement is shown in the diagram below (in the diagram, “high” is the 24 bits from the segment table, and “page” is the 16 bits of the original program's logical address).
I don't get why they used segments. Was it just to make the hash table idea work? And why do segments map to 24 bits of the logical address?
It is a little mind bending, let's explain…
Insight: Fixed Partitioning Returns
Our global logical address space is 52 bits long, which is 4 petabytes. If we broke it up into 4 GB chunks, we could have 1 million 4 GB chunks. If we allocated each process a 4 GB chunk, we could have 1 million processes in (global logical) memory at once. Since we're never likely to have even close to a million processes in memory at once, we've got more than enough space in global memory to allocate each process a fixed chunk of global logical memory.
We then use paging to map global logical memory to physical memory. The hash table is used to map global page numbers to physical page numbers.
But we can do other tricks, too. Suppose we want to reserve 1 GB of space in each process for commonly shared libraries. This shared code can have a location in global logical memory, and each process will use some of its segment registers to map that shared code into its address space.
So it makes sharing easier. I like it!
Okay, I kinda see that, but can you break down how the hash table works?
The hardware's hash function only allows two buckets in the hash table to be examined (but, unlike our diagram, in reality each bucket contains several entries). The hash table size is chosen such that the odds of many pages colliding in the same bucket are low.
Low, but not zero. What if many pages do collide in the same bucket?
If they can't all fit, they can't fit. Some pages won't be in the table. When the CPU tries to access that page, it will cause a trap, and the OS will need to adjust the hash table to throw out some things from that bucket to the needed page can be there. But this situation will be rare.
Whoa, that's crazy! So the OS has to be able to adjust the hash table on the fly?
It reveals an important truth, which is…
Insight: The CPU's Page Tables Don't Need to Reflect “Reality”
When the CPU tries to access a page that isn't in the page tables, it will generate a trap (known as a page fault). The OS is always free to make the page tables represent an incomplete view of reality.
For example, suppose a program asks to allocate 1 GB of (initially empty) memory. One option would be for the OS to actually allocate all those pages at once, but perhaps although the program says it needs 1 GB, it's only going to use 1 MB at a time. The OS can just claim it's allocated the memory, and then wait for the CPU to generate page faults as the program actually uses the memory. Every time the CPU generates a page fault, the OS can allocate another page of the promised memory.
Whoa! So the OS can lie to the CPU about what's allocated and what's not? And a program can think it has memory but it doesn't really exist in physical memory? And the program doesn't even know because when it actually accesses the memory, the OS will make it real? It's like we don't know what's real anymore!
In CS, when something might not truly be real, we call it virtual. When a program's memory might not all exist in the RAM of the computer, we call it virtual memory. And thus our logical address space is better called a virtual address space.
(Talking of terminology, for clarity we've used slightly variant terminology from the PowerPC reference manuals. If you want a side trip into the correct terminology, click the button below.)
PowerPC Terminology
Not that it particularly matters, but while we're talking about terminology, the PowerPC reference manuals use these terms:
- What we called the “segment table” it calls the segment register file, which contains 16 segment registers, which may either be accessed directly by their register name (e.g.,
sr3), or indirectly by their index. - 32-bit addresses within a process are called effective addresses (this is a bit confusing, because on most other systems, we'd call these addresses virtual addresses or logical addresses).
- The 52-bit global logical address is called the interim virtual address, or frequently (and somewhat confusingly) just the virtual address.
The 32-bit PowerPC also had eight “BAT” (Block Address Translation) registers that could be used to map effective addresses to physical memory, bypassing the segment table and hash table. This could be used, for example, to map the entire kernel code as one big chunk of memory, or to map a device's memory into the address space of a process.
I'm not sure that's pedantry. It's important to know the right terms.
🌈 The More You Know! 🌟
If the OS is going to lie to the CPU about what's allocated so that the page tables have an incomplete picture, then the OS will need to have parallel data structures that know the real truth about what's allocated and what's not.
Yes, that's true. The OS has to keep track of what's really allocated, and what's just promised.
Meh, seems like a waste to do everything twice.
And that leads us to our final idea in the world of strange paging options: Get rid of the page tables entirely!
No Page Tables
Whoa! No page tables? How can that work?
Our final option is to have no page tables at all. In all our schemes, the CPU had a translation lookaside buffer (TLB) that cached the most recent translations. Because looking in the page tables is expensive, with multiple memory reads needed, the aim is for the TLB to have the right answer most of the time. But if it does have the right answer most of the time, why not just have only the TLB?
Thus,
- Most of the time, the TLB will map logical addresses to physical addresses.
- The TLB actually has many more entries than the four shown in the diagram, meaning that a “TLB hit” is very likely.
- Occasionally, we'll get a “TLB miss” and have to go to the OS to get the right answer. The OS will look up the answer, and then the CPU will cache it in the TLB.
Won't a TLB miss be slow?
Yes. In the SPARC CPU, it was estimated to cost about 5–20% of additional overhead. But then in later revisions to the SPARC architecture they made one change, making the TLB much larger, with some of it on-chip, but with an overflow area in memory, off-chip. The CPU only asked the OS for help if it couldn't find the page in either the on-chip or off-chip TLB.
Perhaps most importantly for us, the TLB-only design is used by the MIPS architecture, and thus what is used by OS/161.
Caching
I guess both ways have advantages. Using virtual addresses is fast as we can skip the TLB lookup, but then we have to make sure the cache is consistent with the TLB. Using physical addresses avoids this problem, but then we have to wait for the TLB lookup.
I think I see a problem with virtual addresses. If two different virtual addresses map to the same physical address, we might have two different cache entries for the same physical address!
So do modern CPUs use the virtual address or the physical address for caching?
Modern CPUs actually use a combination of both methods to balance speed and correctness.
How Modern CPUs Handle Caching
Modern CPUs often employ a strategy called Virtually Indexed, Physically Tagged (VIPT) caching, especially for their L1 caches. This approach aims to combine the speed advantages of using virtual addresses with the correctness of physical addressing.
- Virtually Indexed: The cache is indexed using bits from the virtual address. This allows the cache access to start immediately, without waiting for address translation.
- Physically Tagged: The cache lines are tagged with physical addresses. After the virtual-to-physical translation is completed (usually via the TLB), the physical tag is compared to ensure the correct data is accessed.
Wait, so we begin accessing the cache before we've fully translated the address?
Exactly, Duck! By using the virtual address to index the cache, we can start the cache lookup earlier, reducing latency.
But if we're wrong, we'll have to throw out the data we fetched, right?
Meh. Sounds complicated.
It does add complexity, but it speeds up memory access significantly.
Avoiding Aliasing Problems
One major concern with using virtual addresses at all is aliasing, where different virtual addresses map to the same physical address. One option is to just outlaw aliasing, but that can be seen as overly restrictive.
We can make the aliasing problem much easier for the hardware people to solve if we ensure aliased addresses on shared pages always map to the same cache set.
Hay! What's a cache set?
Caches divide memory addresses into two parts: the index (e.g., bits 11..0) and the tag (the remaining bits). The index determines which set of cache lines to search, and the tag is used to verify the correct data. A four-way set associative cache has four cache lines in each set, so it can store four different pieces of data that map to the same index.
There are two ways to ensure shared pages map to the same cache set:
- Small cache: If the cache is small enough that the index bits don't extend beyond the page offset, aliasing is automatically avoided (because it's not possible to have the same shared data at different page offsets).
- Larger caches: We can achieve the same goal in software in ensuring that shared pages always have identical index bits, even if they have different virtual addresses. An approach called page coloring is used to ensure that shared pages map to the same cache set. We won't go into the details, but if you're interested, you can find a more detailed explanation on the ARM community website.
What About Larger Caches?
For higher-level caches such as L2 and L3, CPUs typically use Physically Indexed, Physically Tagged (PIPT) caches.
- Physically Indexed: The cache is indexed using the physical address.
- Physically Tagged: The cache lines are tagged with the physical address.
Being as fast as possible isn't as critical for these caches, so the slight delay caused by waiting for address translation isn't as much of a concern.
What about OS/161?
OS/161 uses the MIPS r3000 architecture, which predates VIPT caching. Instead, it uses Virtually Indexed, Virtually Tagged (VIVT) caches. Thus we have to flush the cache when we switch address spaces to ensure that the cache is consistent with the TLB. And to avoid aliasing problems, it's easiest to just outlaw aliasing.
Gah. Am I going to have to worry about all this when I write my OS/161 code?
Actually, the code we give you will take care of most of this stuff for you. But it's always good to know what's going on under the hood.
Meh. What happens under the hood stays under the hood.
(When logged in, completion status appears here.)