Simple Filesystems
Let's begin by trying to come up with the simplest and easiest way to support our low-level filesystem interface.
I've an idea for how to do it. I only need one file. Just give me ALL the disk.
What about me…?
Fine. You can have a file as well. Half each.
Fixed Partitioning
The first scheme for putting files on a disk is about as simple as we can devise. Just split the disk into preallocated fixed-size chunks. For example, we could divide the disk into four equal parts, and each file gets one of those parts. This scheme is called fixed partitioning.
Thus, we'd implement our filesystem operations as follows:
create(): Find a free partition and return its ID (marking it as used).delete(fileID): Mark the partition as free.extend(fileID, count): Increase the space marked as used for this file; if the file size would exceed the partition size, return an error.size(fileID): Return the size of the used space for this file (not the partition size).write(fileID, blockNum, count, data): Write data to the partition for this file starting at a block number and writing a certain number of blocks, using the start of the partition as block 0.read(fileID, blockNum, count): Analogous towrite; returns the data read.
Hay! You talk about “marking” partitions as used or free. How do you do that?
Ah yes, we should talk about metadata for the filesystem!
Metadata
Metadata is data about data. In the case of a filesystem, it's information about the files on the disk. For fixed partitioning, we need to keep track of which partitions are in use and which are free, as well as how big each one is and how much of it is actually in use.
I sort of get it, but can you show me what it looks like?
Okay, let's introduce our filesystem simulator.
Filesystem Simulator
Our filesystem simulator will show you how the filesystem works. You can create files, write to them, and read from them. You can also see how the filesystem keeps track of the files and their sizes.
Because use of a filesystem is a bit more complex than “read a sector, write a sector,” we'll provide our own simple scripting language, SimScript, to let you automate the process of creating files, writing to them, and erasing them. It provides the following operations akin to the ones we've defined:
create(): Create a new file and return its file ID.delete(fileID): Delete a file.extend(fileID, count): Extend a file by a certain number of blocks.size(fileID): Return the size of a file in blocks.write(fileID, blockNum, count): Write data to a file starting at a block number and writing a certain number of blocks—notice that we don't bother with the data itself.read(fileID, blockNum, count): Analogous towrite.push(fileId, count): Add write blocks to the end of a file. It is exactly equivalent tolet oldSize = size(fileId); extend(fileId, count); write(fileId, oldSize, count);, it's just provided for convenience.
Whoa. What's this SimScript language? Did you invent it?
Yes, Prof. Melissa invented it just for this lesson (not to be confused with SIMSCRIPT, which is from 1962!). It's a simple language which is a bit like JavaScript. You should be able to figure it out as we go along.
I'd love it if you could tell me MORE about it now!
Meh. I'm not reading a manual. I'll just wing it.
SimScript
SimScript is a simple scripting language designed specifically for filesystem simulations. It combines familiar JavaScript-like syntax with filesystem-specific operations and parallel execution capabilities.
Variables, Expressions, and Statements
Variables can be declared using let and support basic arithmetic operations:
let x = create() // Store a file ID
let s = size(x) // Get file size
let n = (s * 2) + 1 // Arithmetic expressions
Like Python, you can just start using a variable without declaring it first. If so, it becomes a global variable and begins with the value 0, or whatever value it already had if you did a previous simulation run.
You can end statements with a semicolon, but a newline also works, just like in JavaScript.
Comments start with // and continue to the end of the line, just like in C, C++, and JavaScript. The /* ... */ syntax is not supported.
Operators and Functions
SimScript supports standard arithmetic operators (+, -, *, /, %) and comparison operators (==, !=, <, >, <=, >=) and logical operators (&&, ||, !) with the usual operator precedence found in C and JavaScript.
It includes file-system functions we listed above, and some other helpful functions:
randint(min, max)— Generate a random integer betweenminandmax(inclusive)int(x)— Convert a floating-point number to an integermin(...)andmax(...)— Return the minimum or maximum of the arguments.
Control Flow
SimScript includes loops via a repeat statement and C-like if statements:
// Sequential repetition
repeat 5 {
let f = create()
if (f == 0) {
// First file we've created
push(f, 1)
} else {
push(f, 2)
}
}
SimScript does not include traditional for and while loops, but you can achieve similar functionality with repeat (repeat guarantees a fixed number of iterations).
Advanced Control Flow
SimScript also supports parallel execution blocks and parallel repetition, which can be useful for simulating what happens in a real filesystem when multiple operations are happening at once.
// Parallel execution block
par {
push(f1, 3) // Each statement is executed in parallel
push(f2, 2)
}
// Parallel repetition
par repeat 4 { // Each iteration is executed in parallel
let f = randint(0, maxFileID)
push(f, 1)
}
Meh, this seems over complicated. Why do we need parallel execution?
Often for filesystems, things get interesting when multiple operations are happening at once; one example would be two different processes writing to their own files. We can simulate that with parallel execution.
Error Handling
Operations that would exceed filesystem limits (like extending a file beyond available space) will fail with error messages. The simulation will stop at the point of failure and display the error.
Best Practices
- Use variables to keep track of file IDs
- The
pushoperation is preferred over manualextend/writewhen appending - Parallel blocks can improve simulation realism, but use them carefully
- Don't write to global variables from parallel blocks—which write wins will be unpredictable!
- Add comments to explain complex operations
You can explore the simulator by just clicking the button below. There are things to try next in the text following the simulator.
Actual Simulator
Exploring Fixed Partitioning (and the Simulator)
Hay! I tried clicking a second time but the simulator gave me an error.
That's right. The code creates four files, and our disk only has four partitions—once we've created four files, we can't create four more.
Once you've run the simulation once, try the following:
- Comment out the first line (using
//) so that it doesn't try to create any more files. - Run the simulation again, and it will write to the files that already exist.
- Change the second
repeattopar repeatand see what happens.
Whoa. It's doing all ten writes at once!
That behavior mirrors what might be going on in a real filesystem, where multiple processes are writing to different files at the same time.
And shows that the filesystem code still needs to work correctly when lots of things are happening at once!
I clicked the button to run the simulation multiple times and it errored out with ”Aborted: Extension exceeds partition size.” But there are still some free spots on the disk!
This doesn't seem like a very good scheme. You've got a fixed number of files, which isn't great, but even worse, those files are limited to a fixed size! If you have a small file, you're wasting a lot of space. If you have a big file, it might not fit at all.
But maybe we can find something good to say about it, too? Like… Uh… Well, actually, when you think about the properties that we learned about disk media, particularly spinning disks, having the data be contiguous is actually a good thing. So that's something good.
How about we just go with dynamic partitioning? That way we can have files of different sizes and we can use the disk space more efficiently.
Dynamic Partitioning
Both static partitioning and dynamic partitioning are approaches we first looked at when we considered how to allocate memory to programs, so it makes sense to consider both approaches for filesystems as well, even though dynamic partitioning is rarely used for filesystems in practice.
When you click the button below, the “Dynamic Partitioning” scheme will be added to the filesystem scheme drop-down, the simulator will automatically switch to use that scheme, and you'll get a new script to demonstrate it (replacing any existing script you might have made and erasing any existing filesystem state).
Click the link to go back to the simulator and try running the new script. You can also try switching to a larger disk size and fill it up with files of varying sizes. When you've experimented a bit, you can click the link under the simulator to come back here.
I like it. MORE files!!
Hay, why does the code “Grow the file, block by block”? Why not just do
push(f, size)?
You can change the code to do that if you like. It will behave similarly. But in many real programs, they don't know how big a file will be in advance, so they might grow it incrementally, and this code demonstrates that—several smaller writes rather than one big one.
I tried running our old code from fixed partitioning, but it kept saying “No room to extend file”. That doesn't make sense—I can see free space right there on the disk!
Whoa! I think I see what's happening… Mouse over the green blocks to see what's there at the green blocks—once you create a new file, it's stuck between other files and can't grow anymore!
Um… so does that mean only the last file we create can ever get bigger?
That's right! With dynamic partitioning, files must be contiguous, and new files get placed in the first available space. Which means they immediately block any existing files from growing.
So we've solved the problem of wasted space in fixed partitions, but created a new problem where files can't grow unless they're at the end of the disk. That seems… not great.
Meh. Can't we just move the files around to make space?
Technically, yes. We could compact the disk by moving files around to consolidate free space. But doing that is slow and complex—imagine moving gigabytes of data just to grow a file by a few bytes!
And you'd have to do it every time a file needs to grow! No wonder real filesystems don't use this approach.
Actually, the BBC micro from the 1980s did use this scheme, complete with a
*compactcommand that would slide files down to consolidate free space (but then only allow the last file to grow). It wasn't a very good filesystem.
Hay! I noticed that even code we wrote specifically for dynamic partitioning didn't work when you changed
repeattopar repeat. I guess it's the same issue? Multiple writes happening at once is going to be a problem in this scheme.
A big problem with these simple approaches is their requirement of contiguous allocation. While it's nice to put all the blocks of a file next to each other, insisting on it is too limiting. In practice, we need to be able to put files wherever there's space, even if it's scattered around the disk.
But before we leave this technique, it's worth noting that we can support multiple files being written at once if we preallocate space for them. So, for example, code where create now takes a size parameter and allocates that much space for the file would work.
par repeat 10 {
let size = randint(3, 20);
let f = create(size);
let i = 0;
repeat size { // Write the file, block by block
write(f, i, 1)
i = i + 1;
}
}
Seems pretty limiting to need to know the size of a file in advance. What if you don't know how big a file is going to be?
Exactly. We need to do better.
How about a linked list of blocks?
I love it! We'll make each block link to the next one, and we can put them wherever there's space. We'll need a way to find the start of the file, but that's easy.
Meh. Seems inefficient. You'll have to read all the blocks to find the end of the file.
Maybe we can solve that issue…
File Allocation Table
The File Allocation Table (FAT) is a simple and widely used technique for managing files on disk. It's used in many filesystems, including the venerable FAT16 and FAT32 filesystems typically used on floppy disks, hard drives, and USB sticks.
Rather than store the linkage between blocks in the blocks themselves, that information is stored in the FAT. Each entry in the table corresponds to a block on disk, and the value in the table tells you the location of the next block in the file. The last block in the file has a special value to indicate that it's the end of the file.
Click the button below to switch to the FAT filesystem scheme and set up a new script to demonstrate it. The provided script
- Creates five (additional) files
- Performs seven iterations of the following:
- Adds twelve blocks total to randomly chosen files (in parallel)
- Deletes a random file
- Creates a new file to replace it
As usual, this example is meant to represent a plausible filesystem workload with multiple processes working on different files at the same time.
Don't make any of the other
repeatblocks parallel, though, as that would introduce race conditions by violating the assumption that the available files are described by numbers in the range 0 tof.
I'm still just excited that it works. We can fill up the disk with files of different sizes and they can be anywhere on the disk, and we can extend files so long as there's free space somewhere!
But the blocks for a file are scattered all over the disk. That seems like it would make it slow to read a file, especially if it's on an old-school spinning disk.
Hay! I noticed that when we delete a file, when I mouse over the freed blocks, it tells me what file they used to be part of. That's cool! But it also kinda shows that just because we've deleted a file, that doesn't mean the data is really gone.
I have an idea for a fix! Make the blocks BIGGER!
Adding Clusters
We can't actually make the blocks bigger, as they're a fixed size on the disk, but we can group blocks together to achieve the same effect. We could call them “logical blocks” but that name is taken, so instead we'll call them “clusters”. Each cluster will contain a fixed number of blocks (e.g., eight), and the FAT will point to clusters rather than individual blocks.
The effect of this change is
- The chains of blocks become chains of clusters, and become shorter.
- Within each cluster, blocks are contiguous, so reading a file is faster.
- The FAT table is smaller, as it points to clusters rather than individual blocks.
You can see this in action by clicking the button below to switch to the FAT filesystem with clusters of eight blocks each. The provided script is the same as before, but with the cluster size set to eight.
I like the clusters. The chains are shorter, and reading a file is faster because the blocks are contiguous within a cluster!
But we're wasting space! If a file doesn't use all the blocks in a cluster, we can't use them for anything else. I could store MORE data with the previous scheme!
Hay! I noticed that the first two blocks of the disk are used for metadata, so we're not aligned with the flash blocks on an SSD. That's bad for performance and longevity!
That reminds me—I hope that when we erase blocks, we're using trim to tell the SSD that they're free. Otherwise, we're going to wear out the disk faster.
By gathering blocks into clusters, we now have the problem of internal fragmentation. If a file doesn't use all the blocks in a cluster, we can't use the remaining space for anything else.
Clusters vs. SSDs and Flash Blocks
For SSDs, the problem is even worse. As we saw in the previous lesson, SSDs have a minimum unit of erasure, called a flash block. If our filesystem isn't aligned with the flash blocks, we'll end up with clusters straddling them, which is bad for performance and longevity. We'll need to be careful about how we manage our clusters to avoid this problem. In this case, we should have started the first cluster at block 8, not block 2, to avoid this issue.
It's also important to use the trim command to tell the SSD that a block is no longer in use. This requirement reveals a subtle difference between what the filesystem should do when there is an SSD compared to a spinning disk:
- For an SSD, it's important to use
trimto tell the SSD that a cluster is free. At that point, the data in the cluster is gone. - For a spinning disk, we don't need to do anything special when we erase a file. The data is still there, but the blocks are marked as free. (Which means someone who obtains a disk with “deleted” files may still be able to recover the data. Be sure to run a secure formatting option before you throw away (or donate) a hard disk!)
Complexity and Reliability
In some ways, the FAT scheme is still quite simple, but it is nevertheless more complex than our previous schemes.
Even just writing a simulator of FAT for this webpage, Prof. Melissa had to be very mindful of race conditions when updating the filesystem state. It's easy to get things wrong!
Some good rules of thumb are
- Updates to the FAT need to be atomic—only one process can be adding or removing clusters from the chains at a time.
- Updates to the FAT must precede updates to the disk blocks. Otherwise two processes might think the same block is free and write to it at the same time.
I just realized, we're putting all our eggs in one basket with the FAT table. If it gets corrupted, we lose the whole filesystem!
The file-allocation table itself can be quite large, and it is the only source of metadata about the filesystem. If it's corrupted, the filesystem is lost and cannot easily be reconstructed.
For this reason, it's actually common to have multiple copies of the FAT, so that if one is corrupted, the others can be used to reconstruct the filesystem. This is a simple form of redundancy that can help improve reliability.
Summary of Issues with the FAT Approach
While the FAT filesystem technique does give us a workable filesystem, as we've seen, it has some significant drawbacks. Let's summarize them:
- Internal Fragmentation: If a file doesn't use all the blocks in a cluster, the remaining space is wasted.
- File Fragmentation: Files are no longer contiguous, and blocks are scattered around the disk, which can slow down reading and writing files, particularly on spinning disks.
- Serialization: The need to update the FAT table atomically can create performance bottlenecks if multiple processes are trying to write to the disk at the same time.
- Complexity & Reliability: The FAT table is a single point of failure, and managing it correctly is complex. If the FAT table is corrupted, the filesystem is lost.
- SSD Issues: Older FAT filesystems unaware of SSDs may fail to align clusters with flash blocks or fail to use trim, leading to poor performance and faster wear.
- Space Wastage: The FAT table can be quite large, and if we make duplicates for redundancy, we're wasting even more space.
On the other hand, the FAT filesystem is simple and widely used. For something like a USB stick, where we usually just write files one at a time and often don't bother to erase them, it's a good choice. It's also a good choice for filesystems that need to be read on multiple operating systems, as it's widely supported.
By the way, what's the difference between FAT16 and FAT32?
The number!
FAT16 uses 16 bits for each entry in the FAT table, which limits the number of clusters to 65,536. FAT32 uses 32 bits, which allows for 4,294,967,296 clusters. That's a lot more, but it also means the FAT table is a lot bigger. FAT was originally designed to support floppy disks that held a mere 1.44 MB, so the 65,536 limit was fine. But when hard drives started getting bigger, they needed FAT32.
With a suitably chosen cluster size, FAT32 can support enormous volumes that were unimaginable when the first version of FAT, FAT16, was created.
Other Schemes
We won't simulate the remaining three filesystem schemes we'll look at.
In the first two, what happens at the disk level isn't much different than what we've already seen—how they differ is in how they manage the metadata about what block belongs to what file.
Indexed Allocation
In the indexed allocation scheme, each file has an index block that contains pointers to all the blocks in the file. This approach allows files to be noncontiguous, and the index block can be updated atomically. This scheme is used in the Unix File System (UFS, UFS2; from BSD and thus in many proprietary UNIXes), and Linux's ext2, ext3, and ext4 filesystems.
We can also designate some space in the index block to hold information that the higher-level filesystem might need, such as the file's size in bytes rather than blocks, access permissions, reference counts, and so on. There is one primary index block (a.k.a. index node or inode) for each numeric file ID, so there is a one-to-one correspondence between file IDs and index block numbers, or inode numbers.
What if we need MORE blocks than can fit in the index block?
Then we can have a doubly indexed scheme, where the index block ends with a few pointers to other index blocks, which in turn contain pointers to data blocks. This can be extended to a triply indexed scheme, and so on. The Unix File System (UFS) uses a triply indexed scheme.
Here's a diagram of what an index node looks like in UFS:
In the diagram,
- Yellow areas point directly to data blocks, so we call them direct.
- Cyan areas point to yellow areas, which point to data blocks, so they're called a single indirect.
- Green areas point to cyan areas, which point to yellow areas, which point to data blocks, so they're called a double indirect,
- Pink areas point to green areas, which point to cyan areas, which point to yellow areas, which point to data blocks, so they're called a triple indirect.
I see. It's a bit like the hierarchies we saw in memory management, where we had a page table that pointed to page tables that pointed to pages.
Except it's more complicated, because in page tables all the leaves were at the same level, but here we have different levels of indirection. I guess so we don't waste all that indirection on small files?
How do we find the index block for a file?
Good point. Let's cover that, too.
Finding the Index Block
Each index block corresponds to a numeric file ID, so one option is to do things analogously to a fixed-size array. We decide on the maximum possible number of files we'll have, and put all the index blocks in a row. Then the offset of the desired index block is just the file ID times the size of the index block, which is almost certainly just one logical disk block, so the calculation is trivial.
Whoa! Now we have to know how many files we want ahead of time? That's almost as bad as fixed partitioning!
Meh. Worse is better; that's the Unix way.
I don't like it at all! If I just have a few HUGE files, I'm wasting a lot of space on all those index blocks instead of my data. And if I want MORE files than whoever set the defaults thought I'd need, I'm out of luck.
The key positive features of indexed allocation are
- Distributed Metadata: The metadata for each file is stored separately, rather than using one big table for the whole filesystem as FAT does.
- Completely Flexible Block Allocation: Files can be any size, and blocks can be anywhere on the disk. Although we can have clusters, we don't need them, as the multiple levels of indirection scale well.
- Supports “Sparse” Files: By leaving some direct or indirect pointers null, we can have holes in files where there is no data. This ability is handy for some applications, such as databases, where we might want to reserve space for a file but neither write to it all at once nor always write at the end. Sparse files are not a natural fit for FAT as it links together actual allocated clusters.
Meh. I think it's overkill. We're probably going to be laying out the blocks of a file contiguously anyway, so we don't need all that indirection. I don't want to code all that extra stuff.
Hay! I realized that we're going to need to keep track of free blocks. That's kind of a global thing, isn't it, since they belong to everyone? But inodes are all about individual files. So how are we going to do that?
Good points! They'll get resolved! Let's see if we can come up with something simpler that still meets our needs.
Extents
In the extents scheme, each file has a list of extents, where each extent is a range of contiguous blocks on the disk (as opposed to a single block). This approach allows files to be noncontiguous, but assumes they'll mostly be made from large contiguous chunks.
Implementation is very similar to indexed allocation, where we have an “inode” for each file, but instead of pointing to individual blocks, has a table of extent, where each extent is just the start block and the length. And, like indexed allocation, we have room to store a fixed number of extents in the file's inode, but if necessary we can overflow into additional blocks. Unlike indexed allocation, we don't need a complex hierarchy of indirection because the number of extents is likely to be small even when the total number of blocks in the file is large.
This scheme was used in the HFS+ filesystem that used to be used on Macs and iPhones.
Shouldn't they call it an “enode” then instead of an “inode”, since it's about extents?
Because people tend to forget where terms come from, pretty much any filesystem metadata block that represents a file is called an inode, even if we're not using indexed allocation. And in any case, it is still a kind of index.
Log-Structured Filesystems
In a log-structured filesystem, all changes to the filesystem are written to a log, which is periodically merged into the main filesystem. This approach can be more efficient for SSDs, as it reduces the number of writes to the same block. It's used in the ZFS and OpenZFS filesystems. Apple's newer APFS filesystem is also log-structured, relying on CoW (Copy-on-Write) to ensure that the filesystem is always consistent.
Meh. I don't want to learn about another— I mean, hey! I'd love to go research this one on my own. No need to cover it here! wink
Yeah, super interesting. I'll look it up later, I'm sure. No need to cover it here! wink
I want to know MORE!
shhhh!
Okay, well, we'll set that mostly aside. But we do need to talk about free-space management before we wrap up.
(When logged in, completion status appears here.)