CS 134

Demand Paging

In last week's lesson on paging, we saw the memory map of Neovim running on a 32-bit Raspberry Pi Zero. At the top of the diagram (the bottom of memory) was the code for Neovim, all 2.4 MB of it…

  • Hedgehog speaking

    I don't know if that's a lot or a little.

  • L-Floppy speaking

    Well, the original Mac from the 1980s had 128 KB and could run a fancy GUI and word processor. Neovim is editing text in the terminal. So yeah, although 2.4 MB of RAM usage is no big deal for most people these days, we've got to wonder a bit about code bloat and what exactly all that code is doing.

  • Goat speaking

    Meh. I bet half of it is never even used.

  • PinkRobot speaking

    And that's our point…

The Case for Demand Paging

Here's actual data from my loading Neovim on a Raspberry Pi Zero. The light green shows code pages that were never used at all. The darker colored pages are colored on a gradient showing when they were found to be needed, with teal being the first pages needed and brown being the last pages loaded (you can also hover over any page to see when it was needed).

Page at 0x00000000 Loaded at 0.741ms Page at 0x00016000 Loaded at 1.929ms Page at 0x00026000 Loaded at 20.501ms Page at 0x00031000 Loaded at 41.664ms Page at 0x00041000 Loaded at 43.638ms Page at 0x0005d000 Loaded at 47.802ms Page at 0x00067000 Loaded at 47.693ms Page at 0x00075000 Loaded at 47.027ms Page at 0x00083000 Loaded at 47.401ms Page at 0x00097000 Loaded at 44.735ms Page at 0x000a6000 Loaded at 44.651ms Page at 0x000b2000 Loaded at 47.152ms Page at 0x000c3000 Loaded at 47.236ms Page at 0x000d3000 Loaded at 39.892ms Page at 0x000e5000 Loaded at 44.380ms Page at 0x000fd000 Loaded at 42.751ms Page at 0x0010d000 Loaded at 44.473ms Page at 0x0011e000 Loaded at 42.661ms Page at 0x0012f000 Loaded at 40.376ms Page at 0x00134000 Loaded at 41.460ms Page at 0x00143000 Loaded at 44.063ms Page at 0x00151000 Loaded at 43.970ms Page at 0x00164000 Loaded at 44.568ms Page at 0x00175000 Loaded at 44.835ms Page at 0x00183000 Loaded at 58.938ms Page at 0x0019a000 Loaded at 42.157ms Page at 0x001ae000 Loaded at 42.272ms Page at 0x001b4000 Loaded at 42.353ms Page at 0x001ca000 Loaded at 41.315ms Page at 0x001d5000 Loaded at 45.380ms Page at 0x001e6000 Loaded at 44.238ms Page at 0x001fc000 Loaded at 41.198ms Page at 0x00202000 Loaded at 43.811ms Page at 0x00215000 Loaded at 39.777ms Page at 0x0022b000 Loaded at 41.747ms Page at 0x0023a000 Loaded at 40.153ms Page at 0x00243000 Loaded at 41.577ms Page at 0x00258000 Loaded at 41.869ms Page at 0x00260000 Loaded at 45.057ms Page at 0x00261000 Loaded at 0.776ms Page at 0x00262000 Loaded at 45.489ms Page at 0x00263000 Loaded at 45.270ms Page at 0x00264000 Loaded at 45.163ms Page at 0x00265000 Loaded at 43.734ms Page at 0x00267000 Loaded at 41.360ms Page at 0x00268000 Loaded at 42.058ms Page at 0x00269000 Loaded at 59.034ms 0x0 0x27d000 0.74ms 59.03ms
Neovim Code Pages
  • Horse speaking

    Whoa! Only 49 pages were needed to use Neovim? That's less than 200 KB!

  • PinkRobot speaking

    Yes, Neovim has 638 pages in total, so only 7.7% of the code was used during this editing session.

  • Goat speaking

    Meh. If Prof. Melissa knew how to do more in Vim than toggle in and out of insert mode and save and exit, she might have needed more of the code!

Demand Paging Defined

In the previous lesson we mentioned zero-filled memory, where the operating system can lie to the program and claim it has allocated memory when it hasn't done so yet. The page table entry for the page is marked as not present and the CPU generates a page fault when the program tries to access the page. At that point, the operating system can allocate a page of memory and fill it with zeros, then update the page table entry to point to the new page.

We can use exactly the same trick to load code pages on demand. Code is only loaded when the program tries to run code or access data in that page. This is called demand paging.

  • Cat speaking

    I remember from last time that to make this happen, the operating system needs to make page tables for the CPU that have “false” information (at least compared to what the program thinks is happening). So that means the operating system needs additional data structures to keep track of what's really going on.

  • Dog speaking

    Right. The program tries to access the memory, the CPU generates a page fault saying the page isn't in memory, and the operating system has to figure out what to do about it. If it's a page it had promised should exist, it needs to allocate physical memory and load the page in, adjust the page table, and then let the program continue.

  • PinkRobot speaking

    And that last part, continuing the program, is actually a bit trickier than you might think…

Architectural Corner Cases

Getting a CPU to do paging correctly is a bit more of a challenge than you might think. We'll look at a CISC feature of the Motorola 68000 CPU that caused a problem with demand paging.

  • Horse speaking

    Hay! What's CISC?

CISC vs RISC

The idea of adding “useful features” to CPU instructions is a CPU design philosophy called complex instruction set computing (CISC). If you can make common operations take fewer instructions, you can make programs smaller and faster.

The alternative to CISC is reduced instruction set computing (RISC), where the idea is to make the CPU as simple as possible by keeping instructions simple and regular. If you make the CPU simple, you can make it faster and more power efficient. Today, people might call x86 a CISC architecture and ARM a RISC architecture, but the lines are actually quite blurry these days.

Post-Increment Addressing

One really common pattern in coding is filling an array,

which we might write in C as

for (int* ip = array; ip < array + 1000; ip++) {
    *ip = 0;
}

and compile to Motorola 68000 assembly:

    move.l  #0, D0
    lea     array, A0
    moveq   #1000, D1
loop:
    move.l  D0, (A0)        ; store zero in array
    addq.l  #4, A0          ; move to next element
    dbra    D1, loop

Given how common the “store and move to next” pattern is, the CPU designers decided it was worthwhile to have the move instruction have a special addressing mode that would automatically increment the address register—post-increment addressing. So the assembly code becomes

    move.l  #0, D0
    lea     array, A0
    moveq   #1000, D1
loop:
    move.l  D0, (A0)+       ; store zero in array and increment A0
    dbra    D1, loop

Pre-Decrement Addressing

It made sense to have the opposite instruction, too; one that would first decrement the address register and then store the value—pre-decrement addressing.

To fill the array backwards, we'd write

int *ip = array + 1000;
for (int i = 1000; i != 0; --i) {
    *--ip = 0;
}

which would become

    move.l  #0, D0
    lea     array + 4000, A0
    moveq   #1000, D1
loop:
    move.l  D0, -(A0)       ; decrement A0 and store zero in array
    dbra    D1, loop

Remember that when we have a page fault for demand paging, the program is stopped, the operating system is called on to materialize the page, and then the program is allowed to continue by restarting the instruction that caused the page fault.

What could happen if the program was in the middle of the loop above and a page fault occurred? [Assuming no one has noticed the issue and come up with a workaround.]

  • Duck speaking

    I see the issue. The program will store the value in the wrong location because the decrement happens twice. Once when the instruction is executed the first time, and then again when the instruction is restarted after the page fault.

  • PinkRobot speaking

    That's right!

This issue really did occur with the original 68000 series processor. There were workarounds, such as not using pre-decrement addressing, or having the operating system carefully check the machine code to see if it was a pre-decrement instruction and adjust the address register accordingly. But the issue was fixed in the 68010 and later processors by undoing the decrement when the page fault occurred. (For more details, check out this post on the Retrocomputing Stack Exchange.)

  • Goat speaking

    Meh. This is all ancient history. Who cares about the 68000 series processors?

  • L-Chippy speaking

    Similar issues crop up in contemporary processors too—even RISC ones. Today's ARM64 chips (like Apple's M series) have dedicated instructions for loads and stores (with all other instructions only using registers instead of the smorgasbord of options available on CISC CPUs), but these load and store instructions, such as ldr, str, ldp, and stp, support a range of even more flexible pre- and post-indexing addressing modes than the 68000 did.

  • Rabbit speaking

    Despite being a CISC chip, the x86 actually doesn't have pre-decrement or postincrement addressing modes. But it does have a lot of other features that can interact with page faults in interesting ways, including the rep movsb instruction for copying memory (an instruction with a built-in loop!); the push and pop stack instructions, which are exactly analgous to -(sp) and (sp)+; and “unaligned” writes that can straddle page boundaries, so there's plenty for the x86 designers to be careful about, too.

  • L-Chippy speaking

    Admittedly, both architectures are designed so that what happens when a page fault occurs is well-defined and handled by the CPU so that the instruction can be restarted safely. But this example is just one of the subtle interactions between CPU and operating-system features that can have unexpected ramifications if not thought through carefully.

  • Rabbit speaking

    Also, pre-decrement and post-increment have a long influence on programming culture…

Orgins of p++ and --p

In the early 1970s, as they worked on rewriting the Unix operating system for the PDP-11, Brian Kernighan and Dennis Ritchie wanted to avoid writing the entire operating system in assembler. To cut a long story short, they invented C. But one of the things they wanted out of C was that the assembly code the compiler produced might almost as good as the hand-written assembler they had used previously.

Thus they envisioned writing code C like

strcpy(dest, src)
    char *dest, *src;
{
    register char *d, *s;
    d = dest;
    s = src;
    while (*d++ = *s++) {
        /* do nothing */
    }
}

and having the compiler translate it to

.globl  _strcpy
.text
_strcpy:
        jsr     r5,csv
        mov     4(r5),r4
        mov     6(r5),r3
L4:     movb    (r3)+,(r4)+
        jne     L4
        jmp     cret

That's actual output from the C compiler from Version 7 Unix, run on a PDP-11 simulator (with a few extraneous lines removed and a few adjustments to spacing). Notice that the generated code uses the post-increment addressing mode of the PDP-11. The movb (r3)+,(r4)+ instruction reads a byte from the memory location pointed to by r3, stores it in the memory location pointed to by r4, and then increments both r3 and r4. If the byte it copied isn't zero, the loop continues. This loop is as efficient as any hand-written assembler code could be.

Likewise, we could write

int*
poptonull(stack)
    int *stack;
{
    register int* sp;
    sp = stack;
    while (*--sp) {
        /* do nothing */
    }
    return sp;
}

which would produce

.globl  _poptonu
.text
_poptonu:
        jsr     r5,csv
        mov     4(r5),r4
L4:     tst     -(r4)
        jne     L4
        mov     r4,r0
        jmp     cret

Again, the loop is two instructions long and as efficient as any hand-written assembler code could be.

Symmetry vs Asymmetry

The PDP-11 did not have a pre-increment or post-decrement addressing mode, but Kernighan and Ritchie thought it would be good to allow these operations in C even if the PDP-11 didn't support them.

So you could also write

strcpy(dest, src)
char *dest, *src;
{
    register char *d, *s;
    d = dest-1;
    s = src-1;
    while (*++d = *++s) {
        /* do nothing */
    }
}

and the compiler would do the best it could,

producing

.globl  _strcpy
.text
_strcpy:
        jsr     r5,csv
        mov     4(r5),r4
        dec     r4
        mov     6(r5),r3
        dec     r3
L4:     inc     r4
        inc     r3
        movb    (r3),(r4)
        jne     L4
        jmp     cret

This code works, and it's just a few more instructions than the previous version. But obviously Kernighan and Ritchie would never themselves write the version that would produce this code. So they'd never say *++p or *p--, they'd always write *p++ or *--p.

This habit carried over into loops, where they'd write i++ to increment a loop variable. In many ways, ++i would be more logical, as i++ means increment-but-return-the-old-value, which is not what you want in a loop, and ++i puts the verb at the beginning of the phrase, which is more natural in English. And there was no efficiency gain in using i++ over ++i in a loop. But given that they always used *p++ with pointers, it felt natural for them to write i++. In 1972.

And thus the habit of writing i++ in loops was born, and it's been with us ever since.

  • Goat speaking

    I don't know whether to laugh or cry. 50 years later, we're still writing i++ in loops because of a quirk in the PDP-11 architecture and the muscle memory of two programmers from 1972. Meh. Makes me want to jump in a lake.

  • Duck speaking

    I'll drive you.

  • Hedgehog speaking

    I'll bring the snacks. (FWIW, when I learned to code, everyone used i++ and I thought nothing of it and did the same. I just go along to get along. Doesn't seem like anything to get worked up about.)

Efficiency of Demand Paging

  • Horse speaking

    Hay! Isn't demand paging going to be slow? If we wait to load the code until we need it, when we really do need it, we'll have to wait while the operating system loads the page from disk, right?

What do you think? Is it going to be slower than what we had before?

Demand paging may end up being faster because it has less data to load (196 KB vs 2.4 MB), and the time to load a single page from storage is only a few milliseconds. So while it could be slower in some situations, in the common case, it's probably faster. But this is a classic example of a trade-off: avoiding wasting time loading pages we don't need vs. the time to load a page when we do need it.

Pre-Paging

One way to mitigate the time to load all the pages needed by a program when it starts up is to remember what pages were needed the last time the program ran—pre-paging. The operating system can load these pages into memory before the program starts running, so that when the program does start running, the pages are already in memory.

The downside of pre-paging is that it adds more complexity. We need to have somewhere to keep track of what pages were needed, and exactly which pages are actually needed might vary from run to run. So pre-paging is often considered not worth the effort.

  • Goat speaking

    Meh. I can relate.

Paging Out

  • Cat speaking

    So if we can load pages on demand, what about pages that we only needed to start up the program and then don't need anymore? Can we get rid of them?

  • PinkRobot speaking

    That's a good point!

One option to get rid of unneeded pages is to have the program explicitly tell the operating sytem it doesn't need them anymore. The posix_madvise system call provides a way to do that:

  • posix_madvise(addr, length, POSIX_MADV_DONTNEED) tells the operating system that the program doesn't need the pages in the range [addr, addr+length) anymore.
  • posix_madvise(addr, length, POSIX_MADV_WILLNEED) tells the operating system that the program will need the pages in the range [addr, addr+length) soon (making demand paging more efficient).
  • Goat speaking

    Meh. I don't want the hassle of figuring out what pages my program doesn't need anymore. The operating system should just figure it out.

  • PinkRobot speaking

    Indeed. But we'll have to figure out how…

(When logged in, completion status appears here.)