Understanding SSDs

Wow, all that stuff we were learning last time about spinning disks was so steampunk. I can't believe we still use that technology today. I mean, it's so slow and literally clunky. I'm glad we're moving on to SSDs now.
Meh. I can't believe we had to learn that stuff. Finally, I can forget about tracks and sectors. And no more worrying about bad blocks and crazy LBA mappings to the physical disk. SSDs are so much simpler.
Actually, in many ways, SSDs aren't as different as you might think…
Oh no…

The Characteristics of SSDs

Solid-state drives (SSDs) are a type of nonvolatile storage device that uses flash memory to store data. Unlike traditional hard drives, SSDs have no visibly moving parts, which makes them faster and much less likely to lose your data if you drop them. They're also more energy-efficient and produce less heat than traditional hard drives.

Performance

Let's compare the performance of SSDs to traditional hard drives, comparing at the top end of each technology:

Performance Metric	NVMe SSDs (PCIe 4.0)	Fastest HDDs (15,000 RPM)
Sequential Read	Up to ~7 GB/s	~275 MB/s
Sequential Write	Up to ~7 GB/s	~275 MB/s
Access Time	20-50 µs	5-10 ms

Wow! SSDs are like, orders of magnitude faster than hard drives! No moving parts makes a huge difference.

Physical Structure

The flash memory in an SSD is organized into blocks, which are further divided into pages. Each page is the smallest unit of data that can be read or written to the flash memory. When data is written to an SSD, it is written to an entire page at a time. Some typical sizes are:

Page size: 4–8 KB
Block size: 64–256 pages (0.25–2 MB)

Wow, they sure picked confusing terminology for this stuff. With disks, a block was just one sector, but now it's a whole bunch of “pages”.
So a “block” is actually analogous to a “track” on a spinning disk, and a “page” is like a “sector” (or what we might have called a “block”, like in the acronym “LBA”)?
I guess they kinda wanted to be all “We're memory, not disks”.

Let's dive into looking at the behavior of SSDs with a simulation. In the simulator below, we have a small SSD, and for now, we're just simulating the bare hardware of the flash memory. We'll assume reading is straightforward and analogous to reading RAM. We'll focus on writing and erasing…

SSD Simulator

SSD Size:
Properties:

Write Erase

Write LBA(s):

Status: Idle

Go back to where you came from

Explore!

The toggle under the visualization lets you switch modes between writing (also sometimes called “programming”) and erasing. The drop-downs let you change the size of the SSD, but for now leave it on the smallest size. As we learn more, we'll have different levels of sophistication, but for now you can only select “Physical Basics” from the drop-down for properties.

When you hover the mouse over a page, you can see various details about it and its enclosing block.
Click on some of the pages to write them.
- Once a page is written, we can't write to it again until it has been erased.
Switch to erase mode and try doing some erasing. The way erasing works may surprise you.
Focus on one block and do multiple writes and erases to it.
- To speed up the process, you can just hover the mouse over a cell and press “w” or “e” to write or erase it.

Whoa! I didn't realize that erasing a page would erase the whole block! I thought it would be like a disk where you can just erase a sector. And it's weird that you can't update the stuff you've written; you have to erase it all first.
And then, after a while, the block gets all worn out and you can't use it anymore! That's so different from disks, where you can write and rewrite as much as you want. It's like the SSD is a lot more fragile than a disk. You have to be careful not to wear it out.
Does it really wear out after only eight erase cycles? That seems like it would wear out really fast.
Actually, we made it wear out more quickly than a real flash block so you won't have to keep clicking all day.
It still feels like a strange way for things to work…

An Analogy

One analogy for the physical behavior of flash memory is that it's just like using an old-school pencil and paper. When you write with a pencil, it makes an impression on the paper so it's no longer pristine. Similarly, it's harder to erase a single letter in the middle of a word, so you have to erase the whole word and rewrite it. And too many erasures and rewrites (or a coarse eraser) will wear out the paper, making it harder to write on and eventually impossible. (It might even make a hole in the paper!)

Meh. I don't know about these analogies. That's not really how it works.

Cell Kinds and Endurance

Flash memory cells are actually tiny transistors with an extra “floating gate” that can trap electrons (thanks to an insulating oxide layer that completely surrounds the gate). When electrons are trapped in this gate, they change how the transistor behaves, which we can exploit to store information.

Writing (or “programming”) a cell involves applying a voltage that pushes electrons into the floating gate through a process called “quantum tunneling”. Reading checks to see if the electrons are there by seeing how the transistor responds. Erasing requires a large voltage (around 20 V!) in the opposite direction to forcibly pull all the electrons back out.

That can't be good for the transistor! 20 volts is a lot!

This high-voltage erasure process gradually damages the oxide layer. Over time, the damage accumulates until the layer can no longer reliably trap electrons, at which point the cell becomes unreliable and eventually unusable.

The simplest flash cells are called Single-Level Cells (SLC), which store just one bit per cell—either electrons are trapped there (1) or they're not (0). Even if a cell has deteriorated a bit, we can still reliably distinguish between these two states. But to make SSDs bigger and cheaper, manufacturers developed ways to store more bits per cell by trapping different amounts of electrons:

SLC (1 bit): ~100,000 erase cycles, fastest and most reliable
MLC (2 bits): ~10,000 erase cycles
TLC (3 bits): ~3,000 erase cycles
QLC (4 bits): ~1,000 erase cycles, slowest but highest density

Each additional bit requires the controller to distinguish between more voltage levels (2 levels for SLC, 4 for MLC, 8 for TLC, 16 for QLC), which is why more bits per cell means lower endurance—it's harder to maintain these precise voltage levels over many cycles.

Four times as much data but 100 times less endurance? That seems like a bad trade-off.
Meh. How hard could it be! Sixteen values isn't that many.
Well, let's take a look.

Level-Reading Difficulty Challenge

The little game below helps demonstrate the difficulty of reading multiple bits per cell. The goal is to match the target level by selecting the correct one from the grid (where we use colors to represent different charge levels for the flash cells). The more levels you have to choose from, the harder it is to distinguish between them. You can choose how many levels to have and how much “wear” to apply to the cells to see how it affects your ability to distinguish between them.

Number of Levels:

Wear: 0%

Target Shade

Select the Matching Shade

I think I'm going to avoid QLC drives. I can't even tell the difference between the shades!
Hay! I've had an even more disturbing thought: storing our data depends on these electrons staying trapped in these floating gates. What if they leak out over time? Will our data just disappear?
You're right to be concerned about data retention!

Data Retention

Flash memory cells can lose their charge over time. Manufacturers most prominently rate their SSDs for a certain number of write/erase cycles, but many also specify a data-retention period, which is the amount of time the drive can be left unpowered before data loss becomes a concern. Typical SSDs and SD cards quote a data-retention period of 5–10 years, although various factors (e.g., temperature) can affect this. If you have childhood photos on an old SD card somewhere, it's best to back them up!

A Simple Solution?

Hay! I just had an idea! What if we want to erase just one page? Could we copy all the data we want to keep into RAM, erase the whole block, and then write back just the pages we wanted to keep?

Let's try it out! Click the button below to enable and select this feature in the simulator.

Go Back to Simulator

Now, when you try to erase a single page,

The simulator will first copy all the other written pages from that block into RAM (note that data in the computer's RAM is not on the SSD and so is not shown in the simulator)
Then it erases the entire block
Finally, it writes back all the saved pages (except the one you wanted to erase)

Scroll back and try writing to several pages in a block, then erasing just one of them. Watch how the process unfolds…

Whoa! I didn't realize that erasing a single page could be so complicated! It's like you have to do a whole dance just to erase one page. And if the power goes out in the middle, you could lose all your data!
And with every erase, we're wearing out the block. Once it's worn out, you're stuck—that part of the SSD is dead. How is that even workable?
Wait—didn't we used to do something like this with disks? When we talked about LBA, we mentioned that disks could remap bad sectors with spare ones. Could we do something like that for bad blocks?

Overprovisioning and Block Remapping

Let's add a new feature: spare blocks we can remap. Click the button below to enable and select this feature in the simulator. The idea of keeping some blocks spare is called overprovisioning. When a block wears out, we can just remap it to a spare block and keep going.

Click the button below to enable and select this feature in the simulator, where we've reserved the last two blocks as spares (reducing our user-visible space from eight to six blocks).

Go Back to Simulator

Experiment with writing and erasing pages, and see how the simulator handles block wear and remapping. Try to wear out a block and see what happens. Also notice that the tooltip now shows the user-visible block number and the physical block number (smaller, in parentheses after the user-visible number).

I see! When a block wears out, our SSD just moves the data to a spare block and keeps going. It's like having a backup plan for when things go wrong. But what happens when we run out of spare blocks?
Yeah—we only had two spare blocks, so if we wear out more than two blocks, we're in trouble. We can't keep remapping forever.
Hay! What if we don't wait until a block wears out and remap sooner? If we move the data around, we can spread out the wear and make the SSD last longer.

Block Remap on Erase for Simple Wear Leveling

Because we can already remap blocks, we have a mechanism that we can use to reduce wear. We can erase a block and then juggle the mapping around to point at a block with less wear instead of just reusing the erased block. This approach allows us to spread wear damage across all the blocks more evenly and make the SSD last longer. It also avoids the risk and overhead of copying all the data to RAM and back, since we only erase the source block when data has been successfully copied to a new block and the mapping has been updated.

Note that now it very quickly becomes the case that where something is physically located on the SSD and where it is located logically is quite different. So it makes sense to give the part of the system that performs this remapping a name: the Flash Translation Layer (FTL). It maps user-centric LBAs to the physical blocks and pages on the SSD.

When you click the button below to enable support for the Flash Translation Layer, it will enable two new features in the simulator:

Auto erase: Now, when you try to write a block, if it has already been written to, the SSD will automatically erase it first.
LBA entry: Rather than laboriously clicking on each page to write, you can now enter a range of LBAs to write to. For example, entering 5-10, 3, 15 will write to LBAs 5, 6, 7, 8, 9, 10, 3, and 15. (The simulator will even show you a progress bar as it works through your list of LBAs.)
- Slightly confusingly, the term “LBA” means “logical block address” in the same sense we used it with spinning disks, where it referred to the logical sector number. Here, it's really a logical page number. We'll stick with the common (mis)usage, but we'll always just say “LBA” rather than expanding it to its (confusing) full name.

Click the button below and scroll up and play with the simulator. But before you try a long sequence of LBA requests, first try writing and erasing a few pages to get a feel for what is going on. Remember, the mouse tooltips show you the logical and physical block numbers.

Go Back to Simulator

I think it's getting better! The wear is more spread out now, so the SSD should last longer. But it's still not perfect. When I write over a range of pages, it's really wasteful. The SSD is doing a lot of extra work that it doesn't need to do.
What do you mean?
Well, when I write to a range of pages, the SSD doesn't know that I'm going to erase them all soon. So it has to carefully preserve them, even though they're all going to be erased soon. It's like it's doing a lot of extra work for nothing.
If only we had a way to say, “I don't care about ths page anymore, I don't need it for anything right now, but don't bother copying it, it doesn't need to be kept anymore.” That would save a lot of work.
That would be a good feature! We'll get to that…
Meh. I just keep making writes to the blocks that are the most worn out. I look for the ones that are the most brown and target my writes there. That way I'm done wearing it out sooner.

In the simulator we can see which blocks are the most worn and target them, but a real SSD doesn't supply that information. But it turns out that remapping blocks according to a predictable pattern is almost as risky, because an adversary could predict which blocks will be most worn out and target them for writes (or, in a less hostile world, we could just have an access pattern that coincidentally aligns with the remapping pattern). So in future, we'll adopt a random remapping strategy to avoid this issue—you'll still be able to see which blocks are most worn in teh simulator, but only the drive itself would know in the real world.

Now, let's return to the idea of saying, “I don't care about this page anymore”….

Trimming

Telling the SSD that you no longer care about a page is called trimming. When you trim a page, the SSD knows it contains garbage data and can skip copying it when it remaps the block, which saves a lot of work and wear on the SSD.

Just in time for Thanksgiving, SSDs with all the trimmings!
When SSDs first started appearing, people would replace spinning disks with SSDs for the speed advantage, but operating systems didn't know about trimming. So the SSDs would do a lot of extra work and wear out more quickly than they should have. But these days all major operating systems support trimming.
I'd heard of trim support, but I had no idea what it was for. Now I know it's just telling the SSD, “I don't need this data anymore, you can forget about it.”

When you click the button below to enable both trimming and random block remapping, you'll see the SSD start to behave more like a real SSD. The simulator will now

Use trim: When it has a range of LBAs (pages) to write, it will first trim all the LBAs that are going to be written, so the SSD knows they're garbage and doesn't have to copy them if it needs to remap the block.
Garbage pages: When a page is marked as garbage, it will be shown in a dark blue color.
Random remapping: Instead of remapping blocks in a predictable pattern, the simulator will now remap them randomly, making it much harder for an adversary to predict the new location of the worn blocks.
Runs faster: As we have to do more work to wear out the disk, we've sped up the simulator a bit.

Experiment with the simulator and see how trimming and random remapping change the behavior of the SSD. Try writing to a range of pages and see how the SSD behaves.

Go Back to Simulator

I really think we're getting there. It's pretty hard to wear it out now.
Hay! I noticed that if I keep writing to the same page in a block, it still has to erase the whole block to clear that page. That seems like it's still a bit wasteful.
Could we make our flash translation layer a bit smarter and not just remap blocks, but also remap pages within a block?
Your wish is our command! We'll get to that next.
Oh no! My head is starting to spin more than it did when we covered disks! Is it going to end up being a total free-for-all, where any page can be anywhere?
Some SSDs do exactly that, but we'll keep things simple, and only remap pages within a block, not between blocks.

Page-Level Remapping

In our page-level remapping scheme, when we want to erase a page to write new data, we'll mark the existing physical page as garbage and adjust the page mapping to move one of the other, not-yet-used, pages in to take its place.

Click the button below to enable page-level remapping in the simulator. We've also sped it up just a little more when running LBA request sequences. It's probably a good idea to first click on blocks with the mouse to write and erase a few pages to get a feel for what is going on as manual changes run at the normal speed. Remember that you can mouse over the blocks to see the logical and physical block numbers, and now you can see similar information for pages as well.

Go Back to Simulator

I figured out a way to trigger more erases. If we completely fill a block and then keep writing to the same page in the block, the SSD has to erase the whole block to clear that page, because there aren't any spare pages to remap to.
Gah. It's like we're back to where we started!
I guess we need completely arbitrary page remapping to fix that.
We could, but we'll go with a hybrid scheme.

Hybrid Remapping

Real-world SSDs use a variety of schemes in their flash translation layers to balance wear leveling, performance, and complexity. Many of these schemes are proprietary and they continue to be a matter of research. Some do use completely arbitrary page remapping, but that approach is more complex and can be slower and more demanding for space to store the mapping. We'll adopt a simple hybrid scheme.

Fun fact, Prof. Melissa came up with this approach off the top of her head, rather than looking up a real-world scheme. It was fun for her, and it's realistic enough for our lesson, but don't go mentioning it in an interview at a flash-memory company! You might get blank stares.

The approach we'll use will only affect requests made in the LBA entry mode, because we'll do the remapping at the LBA-to-block/page level.

LBA Permutation

We'll pass each LBA address through a simple permutation function that will map it to a new LBA address. This remapping will spread the wear across the blocks and pages. The permutation is random, so it will be impossible to know where any given LBA will end up.

So if we write half the LBAs on the disk, no matter which half or where, we'll be pretty much guaranteed that all our flash blocks will be half full of pages? That's pretty cool.

Click the button below to enable the hybrid-remapping scheme in the simulator. You can't just click on a cell to write to it now (that would be cheating!)—you have to use the LBA entry mode. Try writing to a range of pages and see how the SSD behaves.

Go Back to Simulator

I think we're there! No matter what I try, I can't seem to wear out the SSD. It's like it's always spreading out the wear.
I think I can see a way. If the disk is almost totally full, there just won't be a lot of options for avoiding block erases. But as long as we keep some space free, we should be good.

We can mitigate the completely-full block issue somewhat by having some number of spare pages per block, which is a page-level version of the overprovisioning we had at the block level. We'll add this feature to the simulator now.

But perhaps the moral is, don't fill your SSD to the brim!

Meh. Are we done yet? Never mind the SSDs, I'm getting worn out! And I'm not even sure why I should care. What's this got to do with the operating system? It's all just stuff the SSD manufacturer has to worry about.
We're nearly done. Let's review the trade-offs we've seen in SSD design and connect it back to the operating system.

Trade-offs in SSD Design

Our journey through increasingly sophisticated SSD management has revealed several key trade-offs:

Space vs. Reliability

We've seen how allocating space for overprovisioning can improve reliability, whether we do it at the block or page level. But overprovisioning comes at the cost of reduced user-visible capacity.

Enterprise SSDs often reserve 20-30% of their capacity for overprovisioning, as well as error-correction codes and other reliability features. Consumer SSDs typically have less overprovisioning, but still enough to ensure good reliability, whereas a USB stick might have very little because it's designed for occasional use and to meet a low price point.

The extent to which a single page write causes more than a single page's worth of wear is known as write amplification. The more overprovisioning we have, the less write amplification we'll see.

What about SD cards? They're basically another kind of SSD, right?
SD cards are an interesting case. In some applications, like digital cameras, they're have large files written to them once, and then they're read many times. But SD cards are also used as the “hard drive” for computers like the Raspberry Pi, where they're written to frequently. And in applications like dash cams, they're written to constantly. So the marketplace for SD cards tends to be much more explicit about their performance and endurance characteristics. A high-end SD card will have more overprovisioning and better wear leveling than a cheap one, but still not as much as a high-end SSD.

For SD cards, cheap SD cards with low write endurance probably use QLC flash, while flash for high-endurance cards is likely to be MLC or TLC. Because the number of bits per flash cell drops, so too does the transfer speed as less data is stored in each cell.

FTL Design Complexity

In this lesson, we've seen increasingly sophisticated FTL designs to manage wear leveling, remapping, and garbage management. More complex FTLs can provide better performance and endurance, but they also require more resources and can be harder to design and test.

To meet price constraints, something like a USB stick might use a very simple FTL, whereas a high-end enterprise SSD might have a very complex FTL to ensure consistent performance and reliability under heavy workloads. Consumer SSDs fall somewhere in between.

Advanced features include

DRAM caching of mapping tables
Redundant storage of mapping tables on the drive itself
Multiple block types
Age tracking so that old data that is unlikely to be rewritten is grouped together in full blocks
Migrating old data on lightly worn blocks to blocks nearing the end of their useful lifespan to obtain nearly fresh blocks for new data

Heavy users of large numbers of SSDs, such as large cloud providers, may find it to be cost effective to write their own custom FTLs optimized for their specific workloads.

So there are really people at, say, Microsoft whose job it is to try to squeeze the best performance out of their SSDs?
Absolutely! They're called storage engineers, and they're responsible for making sure that the storage systems that underpin services like Azure are reliable, fast, and cost-effective. They'll work with the SSD manufacturers to understand the characteristics of the drives they're using and then design their own software to manage them.

Real-World Endurance

While our simulator showed blocks wearing out after just eight erases to make it easier to explore these concepts, real SSDs are much more durable.

Modern SSDs can handle thousands to hundreds of thousands of write cycles per block
Have more sophisticated wear leveling than our simple schemes
Have much more space for overprovisioning
Even consumer QLC drives (with ~1000 cycles per block) can last for years of normal use

sighs with relief So I don't need to worry too much about wearing out my SSD?
Not if you have enough free space for all these tricks to work!
And remember, backing up your important data is always a good idea, regardless of the storage technology!

Operating-System Support

Modern operating systems support SSDs in several ways:

Use trim mark when data becomes garbage: Trim support allows the OS to tell the SSD that a page is no longer needed, improving performance and wear leveling
Use native command queuing: NCQ allows the OS to optimize the order of read and write commands to the SSD—the more the flash memory knows about what's coming, the better it can optimize its wear leveling
Respect alignment: Ensuring that file blocks are aligned with the SSD's page size can improve performance and endurance. For some SSDs with simple FTLs, aligning with the block size can be important, too.
Avoid update-in-place: Where possible, it's usually better to add data to the end of a file rather than overwriting it in the middle. This approach can reduce write amplification and improve performance, especially for simple FTLs.
Avoid swapping to SSD: Swapping memory pages to SSD can cause a lot of writes, so it's best to avoid it if possible as every write comes with a cost in wear. Apple's iOS, for example, doesn't use swap on its SSDs.

Although following these best practices can help improve SSD performance and endurance, modern SSDs are quite resilient and can handle a lot of abuse.

It's always better to treat things well, but it's good to know that SSDs are pretty tough.
Meh. Are we done? I want to level out of here!
We're done.
You've seen how this complex dance of mapping, remapping, and wear leveling is why SSDs are sometimes called “a computer pretending to be a disk.” They're constantly working behind the scenes to maintain the illusion of simple, reliable storage while managing the quirky physics of flash memory. Quite remarkable, really!

(When logged in, completion status appears here.)

Previous Page Next Page

CS 105

Warning: Missing Prerequisites

Understanding SSDs

The Characteristics of SSDs

Performance

Physical Structure

SSD Simulator

Explore!

An Analogy

Cell Kinds and Endurance

Level-Reading Difficulty Challenge

Target Shade

Select the Matching Shade

Data Retention

A Simple Solution?

Overprovisioning and Block Remapping

Block Remap on Erase for Simple Wear Leveling

Trimming

Page-Level Remapping

Hybrid Remapping

LBA Permutation

Trade-offs in SSD Design

Space vs. Reliability

FTL Design Complexity

Real-World Endurance

Operating-System Support

Instructor Controls