CS 134

Stranger Paging Options

  • Cat speaking

    It does seem like there's going to be a lot of overhead for paging. Each process needs its own page table, so we have a complete table for each process. That's a lot of memory overhead, right?

  • PinkRobot speaking

    In our 32-bit system example, each 4 KB page of memory needs a 4-byte page-table entry, so that's 0.1% overhead, which isn't much.

  • Duck speaking

    I don't think multi-level page tables are that smart a way to do it though. In CS 70 we learned about data structures like hash tables and binary search trees. Why not use one of those?

  • PinkRobot speaking

    Remember, the data structure has to be directly supported by the hardware, so it can't be that complicated. But people have thought about other options, and that's what we're going to talk about now.

Hashed Page Tables

Let's look at 32-bit PowerPC CPU, which merged the ideas of segmentation and paging. Each logical address is divided into three parts:

  • 4 bits for the segment number
  • 16 bits for the page number
  • 12 bits for the offset

The segment number then maps to a 16-entry segment table, which provides 24 bits to represent the top bits of a global logical address, so the global logical address is 24 + 16 + 12 = 52 bits.

A single global hash table maps global page numbers to physical page numbers. The hash table is indexed by the global page number, and each entry contains the physical page number. This arrangement is shown in the diagram below (in the diagram, “high” is the 24 bits from the segment table, and “page” is the 16 bits of the original program's logical address).

Processor 4 193 17 4 hash hash table physical address logical address physical memory (frames) 0 1 2 3 4 5 6 7 8 9 10 11 frame page high offset offset frame 3 193 page high 174 789 12 173 173 13 789 0 1 2 3 4 15 14 10 6 0 11 3 8 4 seg 2 14 3 0 high seg 173 174 12 13 14 1 1 2 4 5 6 15 7 14 6 13 12 13 13 17 12
Hashed Page Table
  • Duck speaking

    I don't get why they used segments. Was it just to make the hash table idea work? And why do segments map to 24 bits of the logical address?

  • PinkRobot speaking

    It is a little mind bending, let's explain…

Insight: Fixed Partitioning Returns

Our global logical address space is 52 bits long, which is 4 petabytes. If we broke it up into 4 GB chunks, we could have 1 million 4 GB chunks. If we allocated each process a 4 GB chunk, we could have 1 million processes in (global logical) memory at once. Since we're never likely to have even close to a million processes in memory at once, we've got more than enough space in global memory to allocate each process a fixed chunk of global logical memory.

We then use paging to map global logical memory to physical memory. The hash table is used to map global page numbers to physical page numbers.

But we can do other tricks, too. Suppose we want to reserve 1 GB of space in each process for commonly shared libraries. This shared code can have a location in global logical memory, and each process will use some of its segment registers to map that shared code into its address space.

  • Pig speaking

    So it makes sharing easier. I like it!

  • Hedgehog speaking

    Okay, I kinda see that, but can you break down how the hash table works?

The hardware's hash function only allows two buckets in the hash table to be examined (but, unlike our diagram, in reality each bucket contains several entries). The hash table size is chosen such that the odds of many pages colliding in the same bucket are low.

  • Hedgehog speaking

    Low, but not zero. What if many pages do collide in the same bucket?

  • PinkRobot speaking

    If they can't all fit, they can't fit. Some pages won't be in the table. When the CPU tries to access that page, it will cause a trap, and the OS will need to adjust the hash table to throw out some things from that bucket to the needed page can be there. But this situation will be rare.

  • Horse speaking

    Whoa, that's crazy! So the OS has to be able to adjust the hash table on the fly?

  • PinkRobot speaking

    It reveals an important truth, which is…

Insight: The CPU's Page Tables Don't Need to Reflect “Reality”

When the CPU tries to access a page that isn't in the page tables, it will generate a trap (known as a page fault). The OS is always free to make the page tables represent an incomplete view of reality.

For example, suppose a program asks to allocate 1 GB of (initially empty) memory. One option would be for the OS to actually allocate all those pages at once, but perhaps although the program says it needs 1 GB, it's only going to use 1 MB at a time. The OS can just claim it's allocated the memory, and then wait for the CPU to generate page faults as the program actually uses the memory. Every time the CPU generates a page fault, the OS can allocate another page of the promised memory.

  • Horse speaking

    Whoa! So the OS can lie to the CPU about what's allocated and what's not? And a program can think it has memory but it doesn't really exist in physical memory? And the program doesn't even know because when it actually accesses the memory, the OS will make it real? It's like we don't know what's real anymore!

  • PinkRobot speaking

    In CS, when something might not truly be real, we call it virtual. When a program's memory might not all exist in the RAM of the computer, we call it virtual memory. And thus our logical address space is better called a virtual address space.

(Talking of terminology, for clarity we've used slightly variant terminology from the PowerPC reference manuals. If you want a side trip into the correct terminology, click the button below.)

PowerPC Terminology

Not that it particularly matters, but while we're talking about terminology, the PowerPC reference manuals use these terms:

  • What we called the “segment table” it calls the segment register file, which contains 16 segment registers, which may either be accessed directly by their register name (e.g., sr3), or indirectly by their index.
  • 32-bit addresses within a process are called effective addresses (this is a bit confusing, because on most other systems, we'd call these addresses virtual addresses or logical addresses).
  • The 52-bit global logical address is called the interim virtual address, or frequently (and somewhat confusingly) just the virtual address.

The 32-bit PowerPC also had eight “BAT” (Block Address Translation) registers that could be used to map effective addresses to physical memory, bypassing the segment table and hash table. This could be used, for example, to map the entire kernel code as one big chunk of memory, or to map a device's memory into the address space of a process.

  • Cat speaking

    I'm not sure that's pedantry. It's important to know the right terms.

  • Dog speaking

    🌈 The More You Know! 🌟

  • Dog speaking

    If the OS is going to lie to the CPU about what's allocated so that the page tables have an incomplete picture, then the OS will need to have parallel data structures that know the real truth about what's allocated and what's not.

  • PinkRobot speaking

    Yes, that's true. The OS has to keep track of what's really allocated, and what's just promised.

  • Goat speaking

    Meh, seems like a waste to do everything twice.

  • PinkRobot speaking

    And that leads us to our final idea in the world of strange paging options: Get rid of the page tables entirely!

No Page Tables

  • Horse speaking

    Whoa! No page tables? How can that work?

Our final option is to have no page tables at all. In all our schemes, the CPU had a translation lookaside buffer (TLB) that cached the most recent translations. Because looking in the page tables is expensive, with multiple memory reads needed, the aim is for the TLB to have the right answer most of the time. But if it does have the right answer most of the time, why not just have only the TLB?

physical memory (frames) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 frame Processor 4 193 3 193 physical address logical address 0 3 7 5 translation lookaside buffer page frame frame page offset offset 14 4 0 1
TLB-Only Translation

Thus,

  • Most of the time, the TLB will map logical addresses to physical addresses.
    • The TLB actually has many more entries than the four shown in the diagram, meaning that a “TLB hit” is very likely.
  • Occasionally, we'll get a “TLB miss” and have to go to the OS to get the right answer. The OS will look up the answer, and then the CPU will cache it in the TLB.
  • Cat speaking

    Won't a TLB miss be slow?

  • PinkRobot speaking

    Yes. In the SPARC CPU, it was estimated to cost about 5–20% of additional overhead. But then in later revisions to the SPARC architecture they made one change, making the TLB much larger, with some of it on-chip, but with an overflow area in memory, off-chip. The CPU only asked the OS for help if it couldn't find the page in either the on-chip or off-chip TLB.

  • BlueRobot speaking

    Perhaps most importantly for us, the TLB-only design is used by the MIPS architecture, and thus what is used by OS/161.

In all the schemes, not just the TLB-only scheme above, the operating system will sometimes need to flush some or all of the TLB. When might it need to do this?

Caching

As a final thing to think about as we wrap up virtual-to-physical memory translation, let's think about caching. The CPU caches memory contents and does so based on the memory address of the data. But should that address be the logical/virtual address or the physical address?

Can you see any pros and cons for either option?

  • Dog speaking

    I guess both ways have advantages. Using virtual addresses is fast as we can skip the TLB lookup, but then we have to make sure the cache is consistent with the TLB. Using physical addresses avoids this problem, but then we have to wait for the TLB lookup.

  • Duck speaking

    I think I see a problem with virtual addresses. If two different virtual addresses map to the same physical address, we might have two different cache entries for the same physical address!

  • Cat speaking

    So do modern CPUs use the virtual address or the physical address for caching?

  • PinkRobot speaking

    Modern CPUs actually use a combination of both methods to balance speed and correctness.

How Modern CPUs Handle Caching

Modern CPUs often employ a strategy called Virtually Indexed, Physically Tagged (VIPT) caching, especially for their L1 caches. This approach aims to combine the speed advantages of using virtual addresses with the correctness of physical addressing.

  • Virtually Indexed: The cache is indexed using bits from the virtual address. This allows the cache access to start immediately, without waiting for address translation.
  • Physically Tagged: The cache lines are tagged with physical addresses. After the virtual-to-physical translation is completed (usually via the TLB), the physical tag is compared to ensure the correct data is accessed.
  • Duck speaking

    Wait, so we begin accessing the cache before we've fully translated the address?

  • PinkRobot speaking

    Exactly, Duck! By using the virtual address to index the cache, we can start the cache lookup earlier, reducing latency.

  • Hedgehog speaking

    But if we're wrong, we'll have to throw out the data we fetched, right?

  • Goat speaking

    Meh. Sounds complicated.

  • PinkRobot speaking

    It does add complexity, but it speeds up memory access significantly.

Avoiding Aliasing Problems

One major concern with using virtual addresses at all is aliasing, where different virtual addresses map to the same physical address. One option is to just outlaw aliasing, but that can be seen as overly restrictive.

We can make the aliasing problem much easier for the hardware people to solve if we ensure aliased addresses on shared pages always map to the same cache set.

  • Horse speaking

    Hay! What's a cache set?

  • PinkRobot speaking

    Caches divide memory addresses into two parts: the index (e.g., bits 11..0) and the tag (the remaining bits). The index determines which set of cache lines to search, and the tag is used to verify the correct data. A four-way set associative cache has four cache lines in each set, so it can store four different pieces of data that map to the same index.

There are two ways to ensure shared pages map to the same cache set:

  • Small cache: If the cache is small enough that the index bits don't extend beyond the page offset, aliasing is automatically avoided (because it's not possible to have the same shared data at different page offsets).
  • Larger caches: We can achieve the same goal in software in ensuring that shared pages always have identical index bits, even if they have different virtual addresses. An approach called page coloring is used to ensure that shared pages map to the same cache set. We won't go into the details, but if you're interested, you can find a more detailed explanation on the ARM community website.

What About Larger Caches?

For higher-level caches such as L2 and L3, CPUs typically use Physically Indexed, Physically Tagged (PIPT) caches.

  • Physically Indexed: The cache is indexed using the physical address.
  • Physically Tagged: The cache lines are tagged with the physical address.

Being as fast as possible isn't as critical for these caches, so the slight delay caused by waiting for address translation isn't as much of a concern.

What about OS/161?

OS/161 uses the MIPS r3000 architecture, which predates VIPT caching. Instead, it uses Virtually Indexed, Virtually Tagged (VIVT) caches. Thus we have to flush the cache when we switch address spaces to ensure that the cache is consistent with the TLB. And to avoid aliasing problems, it's easiest to just outlaw aliasing.

  • Hedgehog speaking

    Gah. Am I going to have to worry about all this when I write my OS/161 code?

  • PinkRobot speaking

    Actually, the code we give you will take care of most of this stuff for you. But it's always good to know what's going on under the hood.

  • Goat speaking

    Meh. What happens under the hood stays under the hood.

(When logged in, completion status appears here.)