Out-of-Bounds Access
Indexing Past the End of Arrays
What happens if you try to use an index that's bigger than the size of your array?
What do you think happens?
Well, in languages like Python and Java, you'd get an error…
True! But remember C++'s rallying cries: Power to the People! and Zero Overhead!
I've got a bad feeling about this…
Real-World Undefined Behavior
To understand what happens when you access an array out of bounds, we need to talk about undefined behavior, but to get a sense of what that means, let's start with a real-world example.
Suppose you're driving a car, and you've been given the following map, with strict instructions to “stay on the road”:
You're driving along, and you see a shortcut that looks like it will save you some time. You're tempted to take it, but then you remember the instructions you were given: “stay on the road”. So you keep driving and arrive safely at your destination. But what would have happened if you'd tried to take that shortcut? You don't know! No one has said what will happen, so the consequences are not defined.
Languages like Python or Java like to define what happens when your program breaks the rules. So, in this situation they might say “we have installed guardrails on the road so if you try to drive off the road, you'll just bounce back onto the road”. Or they might have installed guardrails such that if your car touches the rails it will shut itself off and you'll need to be towed.
But installing guardrails is time consuming and expensive. Most roads don't have guardrails—you're just expected to drive properly and stay on the road. If you don't, you'll face the consequences, whatever they are. That's what it means to have undefined behavior. And that's the approach taken by C++—it expects you to follow the rules, and if you don't, it doesn't promise to protect you from the consequences, or even say what the consequences will be.
Perhaps the reason the road curves is that there is a lake in the way…
Or perhaps there's a cliff… But even if there was a dirt road through private property, perhaps a farm, and you drove through it fine three times, the fourth time you might get shot at by the farmer, or hit a closed gate, or get stuck in the mud.
Breaking the rules and getting away with it once doesn't mean you'll get away with it again. And that's what it means to have undefined behavior.
Breaking Array-Access Rules
In Java and Python, if you try to access an array at an index that doesn't exist, there is a predictable outcome (e.g., in Java, an ArrayIndexOutOfBoundsException
will be thrown). For that predictable outcome to happen, there has to be some code (that you didn't write!) that double checks whether the index is in bounds and then either gets the item or throws the exception accordingly.
That little bit of time spent checking the bounds of the array is overhead!
It's an extra cost that the programmer didn't ask for and probably doesn't need most of the time.
And you know by now how overhead is treated in C++'s design.
So how does C++ avoid checking the bounds of the array?
C++'s attitude is that it's not going to babysit you and make sure you're doing everything right. You have rules to follow, like “stay inside the bounds of the array”, and if you break them, who knows what the consequences will be, so don't break the rules.
The C++ standard simply says “all array accesses must be in bounds”. If you break that rule, the consequences are not defined.
Most implementations of C++ make no special effort to check whether you're breaking the rules because that would take valuable time. They just do whatever is easiest and fastest, under the assumption that your program is following the rules. And, like driving off the road, if you break the rules, you might get away with it sometimes, but other times subtle damage will be done that you won't notice right away (e.g., some damage to the underside of your car, or subtle data corruption), and sometimes something catastrophic will happen immediately (e.g., your car catches fire, or your program scribbles over important data or crashes).
The practical upshot is that if C++ says “The consequences of accessing outside the bounds of an array are undefined” what it's really saying is “There is a hard rule: you must never access outside the bounds of an array.” If you break the rule, the consequences are allowed to be arbitrarily bad. You should certainly never write code that knowingly violates C++'s rules on the basis that the undefined behavior seems to be benign because “nothing bad seemed to happen” when you tried it a few times.
Consider this code snippet:
int x = 444;
int a[3]{1, 2, 3};
int y = 555;
cout << a[3] << endl;
Meh. Can it really be the case that anything could happen?
Yes! Let's demonstrate.
Wow, a glimpse into the world of code exploits is a bit scary!
Code exploits, where attackers manipulate both program bugs and C++'s lack of guardrails, show the perils of undefined behavior. Arguably, it shows that for programs that deal with input from untrusted sources, we either need to be especially careful to make sure our programs color inside the lines (e.g., via careful input validation), or we need to choose a language that makes safer choices by default.
Just because most C++ systems have historically optimized for speed over safety doesn't mean they can't also include safety features. There are lots of programs that will never have hackers trying to exploit them, and those programs will run really fast!
Typical Behavior
So if I write a program for this class and it has a bug, it might erase everything on my hard drive?!
No one deliberately writes a compiler that says, “If the programmer makes a silly mistake, erase their hard drive!”.
So for seriously bad things to happen, you need some kind of chain of cause and effect that isn't likely to arise in the kinds of code you write for this class.
Although “undefined behavior” means that any behavior could technically occur, we can usually make educated guesses about what might happen on a typical system—usually by thinking about what would plausibly happen if there were no guardrails at runtime to actively prevent rule breaking by your program.
Although you shouldn't rely on any particular outcome, knowing the likely outcomes of indexing errors could help you diagnose those errors in your own code.
One reason
x
might end up being444
rather than555
is that on this machine, the stack grows downwards from high memory addresses to low ones.Ah. Makes sense. In CS 70's model, we'll stick to counting upwards though!
And, as we said, on a real machine, there's no guarantee that the compiler lays out a function's variables in any particular order. (Even if it did on one machine, the same program compiled on a machine with a different architecture machine might use a very different layout.)
Key Ideas:
- Remember that
a[3]
means*(a +
3
)
, the memory address of the fourth element, if it exited. - So the least-effort outcome is to interpret whatever bits are at that address as an
int
and return that value! - Remember that variables in memory on a real machine need not be arranged in the same order as our memory model!
Segmentation Faults
So… does this mean that I can just access memory outside of my array?
Yeah, usually! The standard doesn't guarantee you can, but in practice, that's the typical behavior.
But… What if I do something horrible?? What if I mess something up for another program? Or the operating system???
Ah! Some good news: you generally don't need to worry about that issue. On a modern computer operating system like Windows, macOS, or Linux, each program's access is limited to its own assigned region of memory.
If your program tries to access memory outside of its allocated region, the operating system will immediately shut the program down. This error is called a segmentation fault, or “segfault” for short.
A common cause for segmentation faults is accessing an array wayyyyyy out of bounds.
Segmentation fault errors can be very cryptic. Your program just stops (“crashes”) and you get a message that basically just says “Segmentation fault”—no line number, no context, nothing. We'll eventually introduce you to a tool that can help give you more information for debugging these errors.
At the very least you now know that if you get a segmentation fault, you should be suspicious of your array indices!
Here's a slightly different snippet:
int x = 444;
int a[3]{1, 2, 3};
int y = 555;
cout << a[1000000000] << endl;
Summary
- Accessing an array out of bounds is “undefined behavior”.
- Literally no outcome are ruled out, as far as the C++ standard is concerned.
- You should never write a program that relies on a particular outcome when the behavior is undefined.
- Typical C++ compilers produce code that does things the fastest and easiest way, without any special safety checks while the program is running.
- Thus we shouldn't be surprised if the program finds the place in memory where the item would have been if the array had been that big, and interprets whatever bits happen to be there as the type it's expecting.
- Accessing an array out of bounds can cause strange behavior because the compiler assumes that you're following the rules.
- On our systems, accessing memory that is way out of bounds can cause your program to crash.
- A modern operating system will not allow your program to access memory outside of its allotted region.
- This error is called a “segmentation fault” (“segfault” for short).
(When logged in, completion status appears here.)