CS 105

A Holistic Question

Suppose you have this code:

struct grade_entry {
    struct details* list;
    float score;
    char letter_grade;
};

and this function:

char get_letter_for(size_t index, struct grade_entry* entries) {
    return entries[index].letter_grade;
}

Assume that the code runs on a machine with this architecture:

  • 64-bit machine
  • x86-64 architecture
  • LP64 data model, using the Linux x86-64 ABI
  • Little-endian
  • A 64KB 8-way set-associative L1 data cache with 64-byte cache lines, using virtual addresses for indexing and tagging, and a write-back, write-allocate policy
  • A 64KB 8-way set-associative L1 instruction cache with 64-byte cache lines, using virtual addresses for indexing and tagging
  • A 1MB 16-way set-associative unified L2 cache with 128-byte cache lines, using physical addresses for indexing and tagging, and a write-back, write-allocate policy
  • 24-bit virtual addresses, with a page size of 4KB
    • 12 bits for the page offset
    • 12 bits for the virtual page number
    • total virtual address space: 16MB
  • 36-bit physical addresses, with a page size of 4KB
    • 12 bits for the page offset
    • 24 bits for the physical page number
    • total physical address space: 64GB
  • A 16-entry fully associative TLB

Your task is to trace the execution of the body of get_letter_for when called with:

index = 130

and with:

entries = 0x678000

You should trace execution up to but not including the ret instruction. In other words, you do not need to consider the stack access performed by ret.

Assume:

  • The code for get_letter_for resides on physical page 0xabcde.
  • The get_letter_for function starts at virtual address 0x51234.
  • The entries array starts at virtual address 0x678000.
  • The get_letter_for function has been called very recently, so its code is likely to be in the instruction cache and TLB.
  • The element at index 130 has never been accessed before.
  • None of its neighbors on the same page have ever been accessed either.
  • The virtual page containing entries[130] is an on-demand zero-filled page that is valid but not currently resident in physical memory.
  • Physical frame 0xf00d5 currently holds the least-recently-used resident page, so it will be evicted and reused for the new zero-filled page.
  • Assume the victim page in physical frame 0xf00d5 is clean, so it does not need to be written back before eviction.
  • You should describe the page fault and its architectural consequences, but you do not need to trace the kernel’s own instruction fetches, data accesses, page-table memory references, or cache behavior while handling the page fault.

Part A: Structure Layout

Determine the layout of struct grade_entry in memory.

Your answer should include:

  • the offset of each field
  • the size of each field
  • any padding bytes
  • the total size of the structure
  • the alignment requirement of the structure

Part B: Assembly Code

Write plausible optimized x86-64 assembly for the body of get_letter_for.

Assume the Linux x86-64 calling convention.

You may find these reminders useful:

  • When using a scale in an addressing mode, the only allowed values for scale are: 1, 2, 4, or 8.
  • shlq $k, %reg shifts a 64-bit register left by k bits. This is equivalent to multiplying an unsigned integer by 2^k.
  • movzbl source, %eax loads one byte from source, zero-extends it to 32 bits, and stores the result in %eax. The low byte %al is then the returned char.

Small bonus: Give an alternative version that uses an add instruction together with a complex addressing mode instead of a shift.


Part C: Address Calculation

For index = 130, determine the virtual address of:

entries[index].letter_grade

Your answer should show:

  • the size of each struct grade_entry
  • the byte offset of entries[130] within the array
  • the byte offset of letter_grade within the struct
  • the final virtual address being read

Part D: Instruction Fetch

The function starts at virtual address:

0x51234

and resides on physical page:

0xabcde

Analyze the instruction fetch for the body of the function.

Your answer should include:

  • the virtual page number and page offset for the instruction address
  • the corresponding physical address
  • the L1 instruction cache set index, block offset, and tag
  • whether this fetch is likely to hit or miss, given the assumptions above

Part E: Data Access, TLB, Page Fault, and Caches

Trace the load of:

entries[130].letter_grade

Your answer should include:

  1. The effective virtual address computed by the load instruction.

  2. The virtual page number and page offset for that address.

  3. The L1 data cache lookup:

    • block offset
    • set index
    • tag
    • whether the access hits or misses
  4. The TLB lookup:

    • whether it hits or misses
    • why
  5. The page fault:

    • why it occurs
    • what physical frame is selected
    • what happens to the old contents of that frame
    • what new virtual-to-physical mapping is established
  6. The physical address of the requested byte after the page fault has been handled.

  7. The L2 cache lookup for the requested physical address:

    • block offset
    • set index
    • tag
  8. The L1 data cache line that will contain the requested byte after the access completes.

  9. The value loaded into %eax.

  10. The value returned by the function, ignoring the later ret instruction.


Part F: Reflection

Briefly answer:

  1. Which architectural details in the prompt mattered for this particular access?

  2. Which details were mostly irrelevant or only mattered indirectly?

  3. Why does little-endianness not affect the value returned by this function?

  4. Why does the write-back/write-allocate policy not matter much for the user-level load itself?

Answers

We did this question in class. To check your work, click the button below to show the answers to all the parts.

We are given:

struct grade_entry {
    struct details* list;
    float score;
    char letter_grade;
};

and:

char get_letter_for(size_t index, struct grade_entry* entries) {
    return entries[index].letter_grade;
}

We are tracing the function body for:

index = 130
entries = 0x678000

We trace up to, but not including, the final ret instruction.


Part A: Structure Layout

On x86-64 Linux using the LP64 data model:

Field Size Alignment Offset
struct details* list 8 bytes 8 bytes 0
float score 4 bytes 4 bytes 8
char letter_grade 1 byte 1 byte 12
padding 3 bytes 13–15

The pointer must be aligned to an 8-byte boundary, and because the structure contains an 8-byte-aligned field, the whole structure has alignment 8.

So:

offsetof(list)         = 0
offsetof(score)        = 8
offsetof(letter_grade) = 12
sizeof(struct grade_entry) = 16
alignof(struct grade_entry) = 8

The final 3 padding bytes ensure that in an array of struct grade_entry, each element begins at an address that is a multiple of 8.

Memory layout:

offset 0–7:    list pointer
offset 8–11:   score
offset 12:     letter_grade
offset 13–15:  padding

Part B: Assembly Code

Using the Linux x86-64 calling convention:

%rdi = first argument = index
%rsi = second argument = entries
%al  = low byte of return value

We want to compute:

entries[index].letter_grade

Each array element is 16 bytes, and letter_grade is at offset 12 within each element.

So the address is:

entries + index * 16 + 12

x86-64 addressing modes only allow scale factors of:

1, 2, 4, or 8

There is no scale factor 16. Therefore, one reasonable optimized implementation is:

shlq    $4, %rdi
movzbl  12(%rsi,%rdi), %eax

The first instruction computes:

%rdi = index * 16

The second instruction loads one byte from:

%rsi + %rdi + 12

and zero-extends it into %eax.

So it implements:

entries + index * 16 + 12

Bonus version

A version using add and a complex addressing mode is:

addq    %rdi, %rdi
movzbl  12(%rsi,%rdi,8), %eax

After the addq, %rdi contains:

2 * index

Then the addressing mode computes:

entries + (2 * index) * 8 + 12
= entries + index * 16 + 12

Same result, slightly more devious. Educationally nutritious, like broccoli with teeth.


Part C: Address Calculation

We are given:

entries = 0x678000
index = 130

Each struct grade_entry is 16 bytes.

So the offset of entries[130] in the array is:

130 * 16 = 2080 = 0x820

The letter_grade field is at offset 12:

12 = 0xc

So the total offset from the base of the array is:

0x820 + 0xc = 0x82c

Therefore the virtual address of entries[130].letter_grade is:

0x678000 + 0x82c = 0x67882c

So the function will attempt to load one byte from:

VA = 0x67882c

Part D: Instruction Fetch

But before we can actually read that memory, we need to fetch the instructions of the function itself, the ones that will actually cause that memory access.

The function starts at virtual address:

0x51234

The page size is 4KB:

4KB = 0x1000

So a virtual address splits into:

virtual page number: upper 12 bits
page offset:         lower 12 bits

For the instruction address:

VA = 0x51234

we get:

VPN    = 0x51
offset = 0x234

We are told that the function resides on physical page:

PPN = 0xabcde

Therefore the corresponding physical address is:

PA = 0xabcde000 + 0x234
   = 0xabcde234

But it turns out the physical address doesn't matter here, because this instruction fetch is likely to hit in the L1 instruction cache, which uses virtual addresses for indexing and tagging.

L1 instruction cache

The L1 instruction cache is:

64KB, 8-way set associative, 64-byte lines

Number of cache lines:

64KB / 64B = 1024 lines

Number of sets:

1024 / 8 = 128 sets

So:

block offset bits = log2(64)   = 6
set index bits    = log2(128)  = 7
tag bits          = 24 - 6 - 7 = 11

The cache uses virtual addresses for indexing and tagging.

For VA = 0x51234, the L1 instruction-cache fields are:

block offset = 0x34 (= 52 in decimal)
set index    = 0x48 (= 72 in decimal)
tag          = 0x28 (= 40 in decimal)

The relevant instruction-cache line begins at:

0x51200

and covers:

0x51200 through 0x5123f

Because the function has been called very recently, its code is likely already in the instruction cache and the relevant translation is likely already in the TLB. Therefore the instruction fetch is likely an L1 instruction-cache hit.

(Note: The bytes for both instructions will total less than 12 bytes, so will fit within the same cache line. We'll still have to do a cache lookup for both instructions.)


Part E: Data Access, TLB, Page Fault, and Caches

The body executes something like:

shlq    $4, %rdi
movzbl  12(%rsi,%rdi), %eax

Initially:

%rdi = 130 = 0x82
%rsi = 0x678000

After:

shlq $4, %rdi

we have:

%rdi = 0x82 << 4 = 0x820

Now the load computes:

12(%rsi,%rdi)
= %rsi + %rdi + 12
= 0x678000 + 0x820 + 0xc
= 0x67882c

So the effective virtual address of the load is:

VA = 0x67882c

Virtual page number and offset

With 4KB pages:

VPN    = 0x678
offset = 0x82c

So the byte is on virtual page:

VPN = 0x678

at page offset:

0x82c

L1 data cache lookup

The L1 data cache is:

64KB, 8-way set associative, 64-byte lines

As with the instruction cache:

number of sets = 128
block offset bits = 6
set index bits    = 7
tag bits          = 11

The L1 data cache uses virtual addresses for indexing and tagging.

For:

VA = 0x67882c

we get:

block offset = 0x2c  (= 44 in decimal)
set index    = 0x20  (= 32 in decimal)
tag          = 0x33c (= 828 in decimal)

The relevant L1 data-cache line begins at:

0x678800

and covers:

0x678800 through 0x67883f

This line contains four 16-byte grade_entry structs:

entries[128] starts at 0x678800
entries[129] starts at 0x678810
entries[130] starts at 0x678820
entries[131] starts at 0x678830

The requested byte is:

entries[130].letter_grade
= 0x678820 + 0xc
= 0x67882c

We are told that this element and its neighbors on the same page have not been accessed before, and that the page is not resident in physical memory. Therefore the L1 data-cache access cannot succeed.

So the L1 data-cache lookup misses, specifically it's a cold miss because the cache line has never been loaded before.


TLB lookup

The load needs to translate:

VA = 0x67882c
VPN = 0x678

We are told that the virtual page containing entries[130] is a valid on-demand zero-filled page, but it is not currently resident in physical memory.

Therefore the TLB does not have a valid resident translation for this page.

So the TLB lookup misses.

The processor checks the page table and discovers that the page is valid but not resident. This causes a page fault, which is handled by the OS kernel.


Page fault

The page fault occurs because virtual page:

VPN = 0x678

is valid from the program's perspective, but it is not currently mapped to a resident physical frame, and so marked as invalid in the page table from the hardware's perspective.

The OS handles the page fault. It needs to allocate a physical frame for the page.

We are told:

physical frame 0xf00d5

currently holds the least-recently-used resident page, so this frame is selected as the victim.

We are also told that the victim page is clean, so it does not need to be written back to memory or disk before eviction. (Phew!)

The OS evicts the old mapping using physical frame 0xf00d5, zero-fills that physical frame, and maps the faulting virtual page to it:

VPN 0x678 -> PPN 0xf00d5

This is done by updating the page table entry for VPN 0x678 to point to PPN 0xf00d5 and marking it as valid and setting appropriate page permissions (read & write, but not execute).

After the page fault is handled, the instruction that caused the fault is restarted.

This is an important point: the load does not merely continue halfway through. Architecturally, the faulting instruction is retried after the OS has fixed the missing mapping.


Physical address after the page fault

The load again accesses:

VA = 0x67882c

Again the CPU checks the L1 cache, and find this virtual address is a miss. It checks the TLB and misses there too. But now when it checks the page table, it finds a valid resident mapping:

VPN 0x678 -> PPN 0xf00d5

The page offset is still:

0x82c

Therefore the physical address is:

PA = 0xf00d5000 + 0x82c
   = 0xf00d582c

So the requested byte is at:

PA = 0xf00d582c

L2 cache lookup

The L2 cache is:

1MB, 16-way set associative, 128-byte lines

Number of cache lines:

1MB / 128B = 8192 lines

Number of sets:

8192 / 16 = 512 sets

So:

block offset bits = log2(128) = 7
set index bits    = log2(512) = 9
tag bits          = 36 - 7 - 9 = 20

The L2 uses physical addresses for indexing and tagging.

For:

PA = 0xf00d582c

we get:

block offset = 0x2c   (= 44 in decimal)
set index    = 0xb0   (= 176 in decimal)
tag          = 0xf00d (= 61453 in decimal)

The relevant L2 cache line begins at:

0xf00d5800

and covers:

0xf00d5800 through 0xf00d587f

Because this is a newly zero-filled page, the exact L2 behavior depends on how the OS performed the zero-fill and whether those cache effects are being modeled. Under the usual simplified trace, we say that after the page fault, the load fetches the needed cache line into the cache hierarchy.

The important result is that the byte at physical address:

0xf00d582c

is read from the newly zero-filled page.

So the byte value is:

0x00

L1 data-cache line after the access

The L1 data-cache line corresponding to the load is the 64-byte line:

VA range: 0x678800 through 0x67883f

which maps to physical addresses:

PA range: 0xf00d5800 through 0xf00d583f

That line is placed into the L1 data cache using the virtual-address fields:

set index = 0x20   (= 32 in decimal)
tag       = 0x33c  (= 828 in decimal)

The requested byte is at line offset:

0x2c

and its value is:

0x00

Value loaded and returned

The instruction:

movzbl 12(%rsi,%rdi), %eax

loads one byte and zero-extends it into %eax.

The byte loaded is:

0x00

Therefore:

%eax = 0x00000000
%al  = 0x00

The function returns the char value:

'\0'

This is not the character '0'; it is the NUL character with numeric value zero.


Part F: Reflection

1. Which architectural details mattered?

These details mattered directly:

  • The LP64 data model, because it tells us pointers are 8 bytes.
  • The Linux x86-64 ABI, because it tells us where arguments and return values go.
  • The alignment rules, because they determine the struct layout and padding.
  • The page size, because it determines the virtual page number and page offset.
  • The virtual and physical address sizes, because they determine how addresses are split.
  • The TLB, because the data page translation is missing.
  • The L1 data-cache line size and associativity, because we compute the line offset, set index, and tag.
  • The L1 instruction-cache line size and associativity, because we analyze the instruction fetch.
  • The L2 line size and associativity, because we compute the physical cache line, set index, and tag.
  • The demand-zero page behavior, because it determines that the loaded byte is zero.
  • The replacement choice of physical frame 0xf00d5, because it determines the eventual physical address.

2. Which details were mostly irrelevant or indirect?

Some details were mostly irrelevant for this particular access:

  • Little-endianness does not affect a single-byte char load.
  • Write-back/write-allocate does not matter much for the user-level load itself.
  • The TLB size does not matter much unless we are also asked which old TLB entry gets evicted.
  • The fact that the L1 data cache is write-back/write-allocate matters more for stores than for this load.
  • The instruction cache details matter only for fetching the function body, not for computing the returned value.
  • The ret instruction would involve a stack read, but we explicitly stop before tracing it.

3. Why does little-endianness not affect the value returned?

Endianness determines how multi-byte values are laid out in memory.

For example, it matters when reading:

float score;
struct details* list;

because those fields occupy multiple bytes.

But letter_grade is a single byte. A single byte has no byte order. There is nothing to reverse.

So little-endianness does not affect this load.


4. Why does write-back/write-allocate not matter much for the user-level load?

The function performs a load, not a store.

Write-back/write-allocate policies mainly describe what happens when data is written:

  • write-back: modified cache lines are written to lower memory only when evicted
  • write-allocate: on a store miss, the cache first loads the line into cache, then writes to it

The user-level instruction here is:

movzbl ...

which reads one byte.

The write-back/write-allocate policy may matter indirectly during page eviction or zero-filling, but it does not significantly affect the semantics of the user-level load itself.


Final Trace Summary

The short version of the whole execution is:

sizeof(struct grade_entry) = 16
offsetof(letter_grade) = 12

index = 130
array offset = 130 * 16 = 0x820
field offset = 0xc
target VA = 0x678000 + 0x820 + 0xc = 0x67882c

target VPN = 0x678
target page offset = 0x82c

The data page is valid but not resident.
The TLB lookup misses.
The page table indicates a demand-zero page.
A page fault occurs.

The OS evicts clean physical frame 0xf00d5.
It zero-fills that frame.
It maps VPN 0x678 -> PPN 0xf00d5.

The load restarts.

VA 0x67882c -> PA 0xf00d582c

The relevant byte comes from a zero-filled page, so it is 0x00.

movzbl loads 0x00 into %eax:

%eax = 0x00000000
%al  = 0x00

The function returns '\0'.

A plausible assembly body is:

shlq    $4, %rdi
movzbl  12(%rsi,%rdi), %eax

Bonus version:

addq    %rdi, %rdi
movzbl  12(%rsi,%rdi,8), %eax

The key result is:

return value = 0

because the accessed field lies on a newly allocated demand-zero page.

(When logged in, completion status appears here.)