A Holistic Question
Suppose you have this code:
struct grade_entry {
struct details* list;
float score;
char letter_grade;
};
and this function:
char get_letter_for(size_t index, struct grade_entry* entries) {
return entries[index].letter_grade;
}
Assume that the code runs on a machine with this architecture:
- 64-bit machine
- x86-64 architecture
- LP64 data model, using the Linux x86-64 ABI
- Little-endian
- A 64KB 8-way set-associative L1 data cache with 64-byte cache lines, using virtual addresses for indexing and tagging, and a write-back, write-allocate policy
- A 64KB 8-way set-associative L1 instruction cache with 64-byte cache lines, using virtual addresses for indexing and tagging
- A 1MB 16-way set-associative unified L2 cache with 128-byte cache lines, using physical addresses for indexing and tagging, and a write-back, write-allocate policy
- 24-bit virtual addresses, with a page size of 4KB
- 12 bits for the page offset
- 12 bits for the virtual page number
- total virtual address space: 16MB
- 36-bit physical addresses, with a page size of 4KB
- 12 bits for the page offset
- 24 bits for the physical page number
- total physical address space: 64GB
- A 16-entry fully associative TLB
Your task is to trace the execution of the body of get_letter_for when called with:
index = 130
and with:
entries = 0x678000
You should trace execution up to but not including the ret instruction. In other words, you do not need to consider the stack access performed by ret.
Assume:
- The code for
get_letter_forresides on physical page0xabcde. - The
get_letter_forfunction starts at virtual address0x51234. - The
entriesarray starts at virtual address0x678000. - The
get_letter_forfunction has been called very recently, so its code is likely to be in the instruction cache and TLB. - The element at index
130has never been accessed before. - None of its neighbors on the same page have ever been accessed either.
- The virtual page containing
entries[130]is an on-demand zero-filled page that is valid but not currently resident in physical memory. - Physical frame
0xf00d5currently holds the least-recently-used resident page, so it will be evicted and reused for the new zero-filled page. - Assume the victim page in physical frame
0xf00d5is clean, so it does not need to be written back before eviction. - You should describe the page fault and its architectural consequences, but you do not need to trace the kernel’s own instruction fetches, data accesses, page-table memory references, or cache behavior while handling the page fault.
Part A: Structure Layout
Determine the layout of struct grade_entry in memory.
Your answer should include:
- the offset of each field
- the size of each field
- any padding bytes
- the total size of the structure
- the alignment requirement of the structure
Part B: Assembly Code
Write plausible optimized x86-64 assembly for the body of get_letter_for.
Assume the Linux x86-64 calling convention.
You may find these reminders useful:
- When using a scale in an addressing mode, the only allowed values for
scaleare:1,2,4, or8. shlq $k, %regshifts a 64-bit register left bykbits. This is equivalent to multiplying an unsigned integer by2^k.movzbl source, %eaxloads one byte fromsource, zero-extends it to 32 bits, and stores the result in%eax. The low byte%alis then the returnedchar.
Small bonus: Give an alternative version that uses an add instruction together with a complex addressing mode instead of a shift.
Part C: Address Calculation
For index = 130, determine the virtual address of:
entries[index].letter_grade
Your answer should show:
- the size of each
struct grade_entry - the byte offset of
entries[130]within the array - the byte offset of
letter_gradewithin the struct - the final virtual address being read
Part D: Instruction Fetch
The function starts at virtual address:
0x51234
and resides on physical page:
0xabcde
Analyze the instruction fetch for the body of the function.
Your answer should include:
- the virtual page number and page offset for the instruction address
- the corresponding physical address
- the L1 instruction cache set index, block offset, and tag
- whether this fetch is likely to hit or miss, given the assumptions above
Part E: Data Access, TLB, Page Fault, and Caches
Trace the load of:
entries[130].letter_grade
Your answer should include:
-
The effective virtual address computed by the load instruction.
-
The virtual page number and page offset for that address.
-
The L1 data cache lookup:
- block offset
- set index
- tag
- whether the access hits or misses
-
The TLB lookup:
- whether it hits or misses
- why
-
The page fault:
- why it occurs
- what physical frame is selected
- what happens to the old contents of that frame
- what new virtual-to-physical mapping is established
-
The physical address of the requested byte after the page fault has been handled.
-
The L2 cache lookup for the requested physical address:
- block offset
- set index
- tag
-
The L1 data cache line that will contain the requested byte after the access completes.
-
The value loaded into
%eax. -
The value returned by the function, ignoring the later
retinstruction.
Part F: Reflection
Briefly answer:
-
Which architectural details in the prompt mattered for this particular access?
-
Which details were mostly irrelevant or only mattered indirectly?
-
Why does little-endianness not affect the value returned by this function?
-
Why does the write-back/write-allocate policy not matter much for the user-level load itself?
Answers
We did this question in class. To check your work, click the button below to show the answers to all the parts.
We are given:
struct grade_entry {
struct details* list;
float score;
char letter_grade;
};
and:
char get_letter_for(size_t index, struct grade_entry* entries) {
return entries[index].letter_grade;
}
We are tracing the function body for:
index = 130
entries = 0x678000
We trace up to, but not including, the final ret instruction.
Part A: Structure Layout
On x86-64 Linux using the LP64 data model:
| Field | Size | Alignment | Offset |
|---|---|---|---|
struct details* list |
8 bytes | 8 bytes | 0 |
float score |
4 bytes | 4 bytes | 8 |
char letter_grade |
1 byte | 1 byte | 12 |
| padding | 3 bytes | — | 13–15 |
The pointer must be aligned to an 8-byte boundary, and because the structure contains an 8-byte-aligned field, the whole structure has alignment 8.
So:
offsetof(list) = 0
offsetof(score) = 8
offsetof(letter_grade) = 12
sizeof(struct grade_entry) = 16
alignof(struct grade_entry) = 8
The final 3 padding bytes ensure that in an array of struct grade_entry, each element begins at an address that is a multiple of 8.
Memory layout:
offset 0–7: list pointer
offset 8–11: score
offset 12: letter_grade
offset 13–15: padding
Part B: Assembly Code
Using the Linux x86-64 calling convention:
%rdi = first argument = index
%rsi = second argument = entries
%al = low byte of return value
We want to compute:
entries[index].letter_grade
Each array element is 16 bytes, and letter_grade is at offset 12 within each element.
So the address is:
entries + index * 16 + 12
x86-64 addressing modes only allow scale factors of:
1, 2, 4, or 8
There is no scale factor 16. Therefore, one reasonable optimized implementation is:
shlq $4, %rdi
movzbl 12(%rsi,%rdi), %eax
The first instruction computes:
%rdi = index * 16
The second instruction loads one byte from:
%rsi + %rdi + 12
and zero-extends it into %eax.
So it implements:
entries + index * 16 + 12
Bonus version
A version using add and a complex addressing mode is:
addq %rdi, %rdi
movzbl 12(%rsi,%rdi,8), %eax
After the addq, %rdi contains:
2 * index
Then the addressing mode computes:
entries + (2 * index) * 8 + 12
= entries + index * 16 + 12
Same result, slightly more devious. Educationally nutritious, like broccoli with teeth.
Part C: Address Calculation
We are given:
entries = 0x678000
index = 130
Each struct grade_entry is 16 bytes.
So the offset of entries[130] in the array is:
130 * 16 = 2080 = 0x820
The letter_grade field is at offset 12:
12 = 0xc
So the total offset from the base of the array is:
0x820 + 0xc = 0x82c
Therefore the virtual address of entries[130].letter_grade is:
0x678000 + 0x82c = 0x67882c
So the function will attempt to load one byte from:
VA = 0x67882c
Part D: Instruction Fetch
But before we can actually read that memory, we need to fetch the instructions of the function itself, the ones that will actually cause that memory access.
The function starts at virtual address:
0x51234
The page size is 4KB:
4KB = 0x1000
So a virtual address splits into:
virtual page number: upper 12 bits
page offset: lower 12 bits
For the instruction address:
VA = 0x51234
we get:
VPN = 0x51
offset = 0x234
We are told that the function resides on physical page:
PPN = 0xabcde
Therefore the corresponding physical address is:
PA = 0xabcde000 + 0x234
= 0xabcde234
But it turns out the physical address doesn't matter here, because this instruction fetch is likely to hit in the L1 instruction cache, which uses virtual addresses for indexing and tagging.
L1 instruction cache
The L1 instruction cache is:
64KB, 8-way set associative, 64-byte lines
Number of cache lines:
64KB / 64B = 1024 lines
Number of sets:
1024 / 8 = 128 sets
So:
block offset bits = log2(64) = 6
set index bits = log2(128) = 7
tag bits = 24 - 6 - 7 = 11
The cache uses virtual addresses for indexing and tagging.
For VA = 0x51234, the L1 instruction-cache fields are:
block offset = 0x34 (= 52 in decimal)
set index = 0x48 (= 72 in decimal)
tag = 0x28 (= 40 in decimal)
The relevant instruction-cache line begins at:
0x51200
and covers:
0x51200 through 0x5123f
Because the function has been called very recently, its code is likely already in the instruction cache and the relevant translation is likely already in the TLB. Therefore the instruction fetch is likely an L1 instruction-cache hit.
(Note: The bytes for both instructions will total less than 12 bytes, so will fit within the same cache line. We'll still have to do a cache lookup for both instructions.)
Part E: Data Access, TLB, Page Fault, and Caches
The body executes something like:
shlq $4, %rdi
movzbl 12(%rsi,%rdi), %eax
Initially:
%rdi = 130 = 0x82
%rsi = 0x678000
After:
shlq $4, %rdi
we have:
%rdi = 0x82 << 4 = 0x820
Now the load computes:
12(%rsi,%rdi)
= %rsi + %rdi + 12
= 0x678000 + 0x820 + 0xc
= 0x67882c
So the effective virtual address of the load is:
VA = 0x67882c
Virtual page number and offset
With 4KB pages:
VPN = 0x678
offset = 0x82c
So the byte is on virtual page:
VPN = 0x678
at page offset:
0x82c
L1 data cache lookup
The L1 data cache is:
64KB, 8-way set associative, 64-byte lines
As with the instruction cache:
number of sets = 128
block offset bits = 6
set index bits = 7
tag bits = 11
The L1 data cache uses virtual addresses for indexing and tagging.
For:
VA = 0x67882c
we get:
block offset = 0x2c (= 44 in decimal)
set index = 0x20 (= 32 in decimal)
tag = 0x33c (= 828 in decimal)
The relevant L1 data-cache line begins at:
0x678800
and covers:
0x678800 through 0x67883f
This line contains four 16-byte grade_entry structs:
entries[128] starts at 0x678800
entries[129] starts at 0x678810
entries[130] starts at 0x678820
entries[131] starts at 0x678830
The requested byte is:
entries[130].letter_grade
= 0x678820 + 0xc
= 0x67882c
We are told that this element and its neighbors on the same page have not been accessed before, and that the page is not resident in physical memory. Therefore the L1 data-cache access cannot succeed.
So the L1 data-cache lookup misses, specifically it's a cold miss because the cache line has never been loaded before.
TLB lookup
The load needs to translate:
VA = 0x67882c
VPN = 0x678
We are told that the virtual page containing entries[130] is a valid on-demand zero-filled page, but it is not currently resident in physical memory.
Therefore the TLB does not have a valid resident translation for this page.
So the TLB lookup misses.
The processor checks the page table and discovers that the page is valid but not resident. This causes a page fault, which is handled by the OS kernel.
Page fault
The page fault occurs because virtual page:
VPN = 0x678
is valid from the program's perspective, but it is not currently mapped to a resident physical frame, and so marked as invalid in the page table from the hardware's perspective.
The OS handles the page fault. It needs to allocate a physical frame for the page.
We are told:
physical frame 0xf00d5
currently holds the least-recently-used resident page, so this frame is selected as the victim.
We are also told that the victim page is clean, so it does not need to be written back to memory or disk before eviction. (Phew!)
The OS evicts the old mapping using physical frame 0xf00d5, zero-fills that physical frame, and maps the faulting virtual page to it:
VPN 0x678 -> PPN 0xf00d5
This is done by updating the page table entry for VPN 0x678 to point to PPN 0xf00d5 and marking it as valid and setting appropriate page permissions (read & write, but not execute).
After the page fault is handled, the instruction that caused the fault is restarted.
This is an important point: the load does not merely continue halfway through. Architecturally, the faulting instruction is retried after the OS has fixed the missing mapping.
Physical address after the page fault
The load again accesses:
VA = 0x67882c
Again the CPU checks the L1 cache, and find this virtual address is a miss. It checks the TLB and misses there too. But now when it checks the page table, it finds a valid resident mapping:
VPN 0x678 -> PPN 0xf00d5
The page offset is still:
0x82c
Therefore the physical address is:
PA = 0xf00d5000 + 0x82c
= 0xf00d582c
So the requested byte is at:
PA = 0xf00d582c
L2 cache lookup
The L2 cache is:
1MB, 16-way set associative, 128-byte lines
Number of cache lines:
1MB / 128B = 8192 lines
Number of sets:
8192 / 16 = 512 sets
So:
block offset bits = log2(128) = 7
set index bits = log2(512) = 9
tag bits = 36 - 7 - 9 = 20
The L2 uses physical addresses for indexing and tagging.
For:
PA = 0xf00d582c
we get:
block offset = 0x2c (= 44 in decimal)
set index = 0xb0 (= 176 in decimal)
tag = 0xf00d (= 61453 in decimal)
The relevant L2 cache line begins at:
0xf00d5800
and covers:
0xf00d5800 through 0xf00d587f
Because this is a newly zero-filled page, the exact L2 behavior depends on how the OS performed the zero-fill and whether those cache effects are being modeled. Under the usual simplified trace, we say that after the page fault, the load fetches the needed cache line into the cache hierarchy.
The important result is that the byte at physical address:
0xf00d582c
is read from the newly zero-filled page.
So the byte value is:
0x00
L1 data-cache line after the access
The L1 data-cache line corresponding to the load is the 64-byte line:
VA range: 0x678800 through 0x67883f
which maps to physical addresses:
PA range: 0xf00d5800 through 0xf00d583f
That line is placed into the L1 data cache using the virtual-address fields:
set index = 0x20 (= 32 in decimal)
tag = 0x33c (= 828 in decimal)
The requested byte is at line offset:
0x2c
and its value is:
0x00
Value loaded and returned
The instruction:
movzbl 12(%rsi,%rdi), %eax
loads one byte and zero-extends it into %eax.
The byte loaded is:
0x00
Therefore:
%eax = 0x00000000
%al = 0x00
The function returns the char value:
'\0'
This is not the character '0'; it is the NUL character with numeric value zero.
Part F: Reflection
1. Which architectural details mattered?
These details mattered directly:
- The LP64 data model, because it tells us pointers are 8 bytes.
- The Linux x86-64 ABI, because it tells us where arguments and return values go.
- The alignment rules, because they determine the struct layout and padding.
- The page size, because it determines the virtual page number and page offset.
- The virtual and physical address sizes, because they determine how addresses are split.
- The TLB, because the data page translation is missing.
- The L1 data-cache line size and associativity, because we compute the line offset, set index, and tag.
- The L1 instruction-cache line size and associativity, because we analyze the instruction fetch.
- The L2 line size and associativity, because we compute the physical cache line, set index, and tag.
- The demand-zero page behavior, because it determines that the loaded byte is zero.
- The replacement choice of physical frame
0xf00d5, because it determines the eventual physical address.
2. Which details were mostly irrelevant or indirect?
Some details were mostly irrelevant for this particular access:
- Little-endianness does not affect a single-byte
charload. - Write-back/write-allocate does not matter much for the user-level load itself.
- The TLB size does not matter much unless we are also asked which old TLB entry gets evicted.
- The fact that the L1 data cache is write-back/write-allocate matters more for stores than for this load.
- The instruction cache details matter only for fetching the function body, not for computing the returned value.
- The
retinstruction would involve a stack read, but we explicitly stop before tracing it.
3. Why does little-endianness not affect the value returned?
Endianness determines how multi-byte values are laid out in memory.
For example, it matters when reading:
float score;
struct details* list;
because those fields occupy multiple bytes.
But letter_grade is a single byte. A single byte has no byte order. There is nothing to reverse.
So little-endianness does not affect this load.
4. Why does write-back/write-allocate not matter much for the user-level load?
The function performs a load, not a store.
Write-back/write-allocate policies mainly describe what happens when data is written:
- write-back: modified cache lines are written to lower memory only when evicted
- write-allocate: on a store miss, the cache first loads the line into cache, then writes to it
The user-level instruction here is:
movzbl ...
which reads one byte.
The write-back/write-allocate policy may matter indirectly during page eviction or zero-filling, but it does not significantly affect the semantics of the user-level load itself.
Final Trace Summary
The short version of the whole execution is:
sizeof(struct grade_entry) = 16
offsetof(letter_grade) = 12
index = 130
array offset = 130 * 16 = 0x820
field offset = 0xc
target VA = 0x678000 + 0x820 + 0xc = 0x67882c
target VPN = 0x678
target page offset = 0x82c
The data page is valid but not resident.
The TLB lookup misses.
The page table indicates a demand-zero page.
A page fault occurs.
The OS evicts clean physical frame 0xf00d5.
It zero-fills that frame.
It maps VPN 0x678 -> PPN 0xf00d5.
The load restarts.
VA 0x67882c -> PA 0xf00d582c
The relevant byte comes from a zero-filled page, so it is 0x00.
movzbl loads 0x00 into %eax:
%eax = 0x00000000
%al = 0x00
The function returns '\0'.
A plausible assembly body is:
shlq $4, %rdi
movzbl 12(%rsi,%rdi), %eax
Bonus version:
addq %rdi, %rdi
movzbl 12(%rsi,%rdi,8), %eax
The key result is:
return value = 0
because the accessed field lies on a newly allocated demand-zero page.
(When logged in, completion status appears here.)