Demand Paging
In last week's lesson on paging, we saw the memory map of Neovim running on a 32-bit Raspberry Pi Zero. At the top of the diagram (the bottom of memory) was the code for Neovim, all 2.4 MB of it…
I don't know if that's a lot or a little.
Well, the original Mac from the 1980s had 128 KB and could run a fancy GUI and word processor. Neovim is editing text in the terminal. So yeah, although 2.4 MB of RAM usage is no big deal for most people these days, we've got to wonder a bit about code bloat and what exactly all that code is doing.
Meh. I bet half of it is never even used.
And that's our point…
The Case for Demand Paging
Here's actual data from my loading Neovim on a Raspberry Pi Zero. The light green shows code pages that were never used at all. The darker colored pages are colored on a gradient showing when they were found to be needed, with teal being the first pages needed and brown being the last pages loaded (you can also hover over any page to see when it was needed).
Whoa! Only 49 pages were needed to use Neovim? That's less than 200 KB!
Yes, Neovim has 638 pages in total, so only 7.7% of the code was used during this editing session.
Meh. If Prof. Melissa knew how to do more in Vim than toggle in and out of insert mode and save and exit, she might have needed more of the code!
Demand Paging Defined
In the previous lesson we mentioned zero-filled memory, where the operating system can lie to the program and claim it has allocated memory when it hasn't done so yet. The page table entry for the page is marked as not present and the CPU generates a page fault when the program tries to access the page. At that point, the operating system can allocate a page of memory and fill it with zeros, then update the page table entry to point to the new page.
We can use exactly the same trick to load code pages on demand. Code is only loaded when the program tries to run code or access data in that page. This is called demand paging.
I remember from last time that to make this happen, the operating system needs to make page tables for the CPU that have “false” information (at least compared to what the program thinks is happening). So that means the operating system needs additional data structures to keep track of what's really going on.
Right. The program tries to access the memory, the CPU generates a page fault saying the page isn't in memory, and the operating system has to figure out what to do about it. If it's a page it had promised should exist, it needs to allocate physical memory and load the page in, adjust the page table, and then let the program continue.
And that last part, continuing the program, is actually a bit trickier than you might think…
Architectural Corner Cases
Getting a CPU to do paging correctly is a bit more of a challenge than you might think. We'll look at a CISC feature of the Motorola 68000 CPU that caused a problem with demand paging.
Hay! What's CISC?
CISC vs RISC
The idea of adding “useful features” to CPU instructions is a CPU design philosophy called complex instruction set computing (CISC). If you can make common operations take fewer instructions, you can make programs smaller and faster.
The alternative to CISC is reduced instruction set computing (RISC), where the idea is to make the CPU as simple as possible by keeping instructions simple and regular. If you make the CPU simple, you can make it faster and more power efficient. Today, people might call x86 a CISC architecture and ARM a RISC architecture, but the lines are actually quite blurry these days.
Post-Increment Addressing
One really common pattern in coding is filling an array,
which we might write in C as
for (int* ip = array; ip < array + 1000; ip++) {
*ip = 0;
}
and compile to Motorola 68000 assembly:
move.l #0, D0
lea array, A0
moveq #1000, D1
loop:
move.l D0, (A0) ; store zero in array
addq.l #4, A0 ; move to next element
dbra D1, loop
Given how common the “store and move to next” pattern is, the CPU designers decided it was worthwhile to have the move instruction have a special addressing mode that would automatically increment the address register—post-increment addressing. So the assembly code becomes
move.l #0, D0
lea array, A0
moveq #1000, D1
loop:
move.l D0, (A0)+ ; store zero in array and increment A0
dbra D1, loop
Pre-Decrement Addressing
It made sense to have the opposite instruction, too; one that would first decrement the address register and then store the value—pre-decrement addressing.
To fill the array backwards, we'd write
int *ip = array + 1000;
for (int i = 1000; i != 0; --i) {
*--ip = 0;
}
which would become
move.l #0, D0
lea array + 4000, A0
moveq #1000, D1
loop:
move.l D0, -(A0) ; decrement A0 and store zero in array
dbra D1, loop
I see the issue. The program will store the value in the wrong location because the decrement happens twice. Once when the instruction is executed the first time, and then again when the instruction is restarted after the page fault.
That's right!
This issue really did occur with the original 68000 series processor. There were workarounds, such as not using pre-decrement addressing, or having the operating system carefully check the machine code to see if it was a pre-decrement instruction and adjust the address register accordingly. But the issue was fixed in the 68010 and later processors by undoing the decrement when the page fault occurred. (For more details, check out this post on the Retrocomputing Stack Exchange.)
Meh. This is all ancient history. Who cares about the 68000 series processors?
Similar issues crop up in contemporary processors too—even RISC ones. Today's ARM64 chips (like Apple's M series) have dedicated instructions for loads and stores (with all other instructions only using registers instead of the smorgasbord of options available on CISC CPUs), but these load and store instructions, such as
ldr,str,ldp, andstp, support a range of even more flexible pre- and post-indexing addressing modes than the 68000 did.
Despite being a CISC chip, the x86 actually doesn't have pre-decrement or postincrement addressing modes. But it does have a lot of other features that can interact with page faults in interesting ways, including the
rep movsbinstruction for copying memory (an instruction with a built-in loop!); thepushandpopstack instructions, which are exactly analgous to-(sp)and(sp)+; and “unaligned” writes that can straddle page boundaries, so there's plenty for the x86 designers to be careful about, too.
Admittedly, both architectures are designed so that what happens when a page fault occurs is well-defined and handled by the CPU so that the instruction can be restarted safely. But this example is just one of the subtle interactions between CPU and operating-system features that can have unexpected ramifications if not thought through carefully.
Also, pre-decrement and post-increment have a long influence on programming culture…
Orgins of p++ and --p
In the early 1970s, as they worked on rewriting the Unix operating system for the PDP-11, Brian Kernighan and Dennis Ritchie wanted to avoid writing the entire operating system in assembler. To cut a long story short, they invented C. But one of the things they wanted out of C was that the assembly code the compiler produced might almost as good as the hand-written assembler they had used previously.
Thus they envisioned writing code C like
strcpy(dest, src)
char *dest, *src;
{
register char *d, *s;
d = dest;
s = src;
while (*d++ = *s++) {
/* do nothing */
}
}
and having the compiler translate it to
.globl _strcpy
.text
_strcpy:
jsr r5,csv
mov 4(r5),r4
mov 6(r5),r3
L4: movb (r3)+,(r4)+
jne L4
jmp cret
That's actual output from the C compiler from Version 7 Unix, run on a PDP-11 simulator (with a few extraneous lines removed and a few adjustments to spacing). Notice that the generated code uses the post-increment addressing mode of the PDP-11. The movb (r3)+,(r4)+ instruction reads a byte from the memory location pointed to by r3, stores it in the memory location pointed to by r4, and then increments both r3 and r4. If the byte it copied isn't zero, the loop continues. This loop is as efficient as any hand-written assembler code could be.
Likewise, we could write
int*
poptonull(stack)
int *stack;
{
register int* sp;
sp = stack;
while (*--sp) {
/* do nothing */
}
return sp;
}
which would produce
.globl _poptonu
.text
_poptonu:
jsr r5,csv
mov 4(r5),r4
L4: tst -(r4)
jne L4
mov r4,r0
jmp cret
Again, the loop is two instructions long and as efficient as any hand-written assembler code could be.
Symmetry vs Asymmetry
The PDP-11 did not have a pre-increment or post-decrement addressing mode, but Kernighan and Ritchie thought it would be good to allow these operations in C even if the PDP-11 didn't support them.
So you could also write
strcpy(dest, src)
char *dest, *src;
{
register char *d, *s;
d = dest-1;
s = src-1;
while (*++d = *++s) {
/* do nothing */
}
}
and the compiler would do the best it could,
producing
.globl _strcpy
.text
_strcpy:
jsr r5,csv
mov 4(r5),r4
dec r4
mov 6(r5),r3
dec r3
L4: inc r4
inc r3
movb (r3),(r4)
jne L4
jmp cret
This code works, and it's just a few more instructions than the previous version. But obviously Kernighan and Ritchie would never themselves write the version that would produce this code. So they'd never say *++p or *p--, they'd always write *p++ or *--p.
This habit carried over into loops, where they'd write i++ to increment a loop variable. In many ways, ++i would be more logical, as i++ means increment-but-return-the-old-value, which is not what you want in a loop, and ++i puts the verb at the beginning of the phrase, which is more natural in English. And there was no efficiency gain in using i++ over ++i in a loop. But given that they always used *p++ with pointers, it felt natural for them to write i++. In 1972.
And thus the habit of writing i++ in loops was born, and it's been with us ever since.
I don't know whether to laugh or cry. 50 years later, we're still writing
i++in loops because of a quirk in the PDP-11 architecture and the muscle memory of two programmers from 1972. Meh. Makes me want to jump in a lake.
I'll drive you.
I'll bring the snacks. (FWIW, when I learned to code, everyone used
i++and I thought nothing of it and did the same. I just go along to get along. Doesn't seem like anything to get worked up about.)
Efficiency of Demand Paging
Hay! Isn't demand paging going to be slow? If we wait to load the code until we need it, when we really do need it, we'll have to wait while the operating system loads the page from disk, right?
Demand paging may end up being faster because it has less data to load (196 KB vs 2.4 MB), and the time to load a single page from storage is only a few milliseconds. So while it could be slower in some situations, in the common case, it's probably faster. But this is a classic example of a trade-off: avoiding wasting time loading pages we don't need vs. the time to load a page when we do need it.
Pre-Paging
One way to mitigate the time to load all the pages needed by a program when it starts up is to remember what pages were needed the last time the program ran—pre-paging. The operating system can load these pages into memory before the program starts running, so that when the program does start running, the pages are already in memory.
The downside of pre-paging is that it adds more complexity. We need to have somewhere to keep track of what pages were needed, and exactly which pages are actually needed might vary from run to run. So pre-paging is often considered not worth the effort.
Meh. I can relate.
Paging Out
So if we can load pages on demand, what about pages that we only needed to start up the program and then don't need anymore? Can we get rid of them?
That's a good point!
One option to get rid of unneeded pages is to have the program explicitly tell the operating sytem it doesn't need them anymore. The posix_madvise system call provides a way to do that:
posix_madvise(addr, length, POSIX_MADV_DONTNEED)tells the operating system that the program doesn't need the pages in the range[addr, addr+length)anymore.posix_madvise(addr, length, POSIX_MADV_WILLNEED)tells the operating system that the program will need the pages in the range[addr, addr+length)soon (making demand paging more efficient).
Meh. I don't want the hassle of figuring out what pages my program doesn't need anymore. The operating system should just figure it out.
Indeed. But we'll have to figure out how…
(When logged in, completion status appears here.)