Resource Identifiers

User programs need to deal with a variety of “things” outside of themselves, including

Files
Devices (such as the console)
Other processes
Network connections
Users (and groups of users)
Access rights

In particular, a program may wish to say things like

“Write this string to this file I opened earlier.”
“Read the next character from the terminal.”
“Wait for this other process to finish.”
“Check whether my userid is in the administrative-access group.”

Each of these actions involves some kind of resource that the operating system manages. The question is: How does a program refer to these resources?

In the Past: “Operating System” as a Library

In the earliest days of computing, the operating system was just a set of subroutines linked into a program (and only one program ran at a time, with full control of the machine). In this model, the program could refer to resources directly.

Thus, we might have a struct openfile that represents an open file, and a program might refer to a file by passing around a pointer to an openfile structure. This style corresponds to how we'd code anything else—using pointers to objects to refer to them. But when we get to the point where multiple programs are running at once, with some level of isolation between them, this approach breaks down—we need another way to refer to resources.

Hay! I wrote C++ code in CS 70, and we had iostream objects that represented files. So surely we do still use pointers to objects to refer to resources, right?
And actually, the C standard library is FILE* objects, which are pointers to file objects. So yeah, I think Horse is right, we still use pointers to objects to refer to resources.
Well, here's the thing. Although those might be pointers to objects, they're not pointers to objects that are directly managed by the operating system. They're managed by the C (and C++) standard libraries, and are, in fact, wrappers around the actual operating system resources. The operating system itself doesn't know about FILE* objects.
Okay, but why won't that work? Why can't we just have pointers to objects that are managed by the operating system?

Suppose the operating system provided this interface:

struct openfile* open(const char* filename, int flags);
int write(struct openfile* file, const void* buf, size_t len);
int read(struct openfile* file, void* buf, size_t len);
int close(struct openfile* file);

The first thing to wonder about is where is this pointer pointing? Does the pointer point to somewhere inside memory that belongs to the kernel itself, or is it pointing to some place in memory that belongs to the user program?

Think about both options, and why each of them is probably a bad idea. Hint: Think about the operating system's need to behave correctly and securely.

What if a user calls read or write where the file argument just points to some random memory? That seems bad.
That's right! If the operating system just took a pointer from the user and assumed it pointed to a valid struct openfile, the OS might end up with junk data that could cause a crash. And if the pointer pointed to some piece of kernel memory, that could be even worse, as that could cause subtle corruption of the entire system.
Can't we just check the pointer points to a valid struct openfile before we use it?
How? Could you check whether pointers were valid in the code you wrote for CS 70?
Uh… no. I guess not.
Well, it's not impossible, but it would be a lot of work. You'd essentially need a list of all the valid pointers to check against, and if you do that, you've lost the directness and efficiency that pointers give you.
What if it was a pointer to data in user memory?
We'd still have the possibility that the pointer could point to some data that isn't a valid struct openfile, but now we also need to worry about the user modifying that data, which could create problems that result in the operating system behaving incorrectly.

For example, a program might open an important file in read-only mode (which is allowed), but then change the struct openfile to write mode (which was not supposed to be allowed). That change could allow the program to write to a file that it wasn't supposed to be able to write to.

User Space vs. Kernel Space

The operating system is responsible for managing the resources of the computer, and it needs to do so in a way that is secure and efficient. One of the key principles of operating system design is the separation of user space and kernel space.

graph TD subgraph US[User Space] A[User Applications] B[Libraries] end subgraph KS[Kernel Space] C[System Calls] D[Device Drivers] E[Memory Management] F[Process Management] end US -.->|System Call Interface| KS classDef userSpace fill:#f9f,stroke:#333,stroke-width:2px; classDef kernelSpace fill:#bbf,stroke:#333,stroke-width:2px; class US userSpace; class KS kernelSpace;

In essence, there is a wall between user space and kernel space. User programs run in user space, and cannot directly access anything in kernel space. In contrast, the operating system runs in kernel space, and has full access to the entire computer, but it cannot necessarily trust any information it receives from user space.

Thus, we need to be very careful about how we allow data to cross the boundary between user space and kernel space. The operating system needs to ensure that the data it receives from user programs is valid and safe to use. As we've seen, we can't safely store struct openfile objects in user space, or trust that struct openfile pointers are valid. We need a different way to refer to resources.

So it can't be a pointer. But it has to be something. And ideally it should be something that's easy.
How about a number? Numbers are easy.
Yeah, it could just be an array index or something. That's easy to check and easy to use.
Let's see how that approach might work.

Numeric Identifiers

One straightforward way to refer to resources is to use a number. The most obvious number to use is an index into an array of resource structures. For example, we might have an array of struct process objects, and refer to a process by its index in the array.

Index	struct process*
[0]	(null)
[1]	0x87654321
[2]	0x12345678
[3]	0xabcdef01
[4]	(null)
[5]	0xfedcba98
...	...

So user programs will just say, “I want to know if process 3 is running”, or something like that?
That's right. The operating system can then look up the struct process object for process 3 by using the index 3 to find the right entry in the array.

So we can just check that the index is in bounds and also that the value in the array element at that index is not null?
That seems so much easier than trying to figure out a way to check if a pointer is valid, which I'm not even sure is possible.
That's right.
I also think it's more human-readable. It's easier to talk about “process 3” than “the process at memory address 0xabcdef01”.
Yes. In fact, if we're a user program, we can't even get the memory address of the struct process object. We can only refer to it by its index in the array.

Tagged Indexes

In some systems, we want to be even more sure that the index is valid. For example, we might want to ensure that the index refers to a process that the user program is allowed to access. In this case, we might use a tagged index.

A tagged index is a pair of a number and a tag (sometime called a uniqifier). The tag is a value that is associated with the resource, and it is used to ensure that the index is valid. So, for example, our process table could instead look like

Index	Tag	struct process*
[0]	0	(null)
[1]	42	0x87654321
[2]	17	0x12345678
[3]	54	0xabcdef01
[4]	0	(null)
[5]	93	0xfedcba98
...	...	...

Now the resource identifier is a pair of a number and a tag, but those identifiers can usually be combined into a single number. For example, we might use the formula tag << 16 | index to combine the tag and index into a single number.

To ensure that we have unique tags, we might use consecutive numbers (sometimes called generation numbers) or we might use completely random numbers. Those approaches guarantee that will have a different tag when reusing a slot in the table, which makes it harder for a program to accidentally refer to the wrong resource.

Hay! What does the << operator do there?
We're saying, “shift the bits of tag left by 16 places, and then use bitwise or (|) to combine that with index”. So if tag is 42 and index is 3, then 42 << 16 | 3 is 0x2A0003, or 2752515 in decimal.
So that's more secure?
Well, it's not exactly secure in the strong sense, because a program could keep trying different tags until it found one that worked. But it can catch mistakes where a program tries to use an index that it shouldn't.
Meh. But now it's not as human readable.
That's true. But, in practice, we don't usually need to worry about users reading the resource identifiers, so it's not a big deal.

Hash Tables

A very similar approach is to use a hash table with the resource identifier as the key. If we keep the key as an number, a hash table with linear probing is remarkably similar to our tagged-index approach

So which of these approaches do real operating systems use?
It depends.

Unix-like systems often feel like they're always using plain indexes, because they often produce relatively small numbers, and some resources, like file descriptors, clearly are just plain indexes. We'll revisit Unix's choices when we talk about process IDs.

Windows, on the other hand, has a HANDLE type that is explicitly a tagged index for various resources.

Fixed-Size Tables for Numeric Identifiers

I remember from CS 70 that primitive arrays are fixed-size. So does that mean that the process table needs to be fixed-size?
Hay! Wait a minute—in CS 70 we also saw that we could make our own array-like data structures that could grow and shrink. So why can't we do that here?
Meh. I don't wanna do that. Give users a limit and if they hit it, they hit it. That way I can go home early.

OS designers have a choice to make. If they impose fixed limits on parameters such as the number of possible processes, number of possible open files, and so on, they can use fixed-size tables, which are simpler to implement and also typically just a little bit faster than dynamic data structures.

In the early days of Bell Labs's Unix and, later, in early versions of Linux, fixed-size tables were used. Linux set the limit on the maximum number of open files per process at 1024, and the maximum number of processes at 4096. Today, Linux uses dynamic data structures for these tables, but there are still maximum limits (though typically set much higher, and usually changeable by the system administrator for specific use cases).

What (Else) Do We Call These Numeric Identifiers?

Although we've used the term “numeric identifiers” here, various terms are used in practice, including

Handle: Windows uses the term “handle” for (tagged) numeric identifiers.
Descriptor: Unix-like systems often use the term “descriptor” for resource identifiers (e.g., file descriptors).
Identifier: The term “identifier” is also used, as it's more general (e.g., PID means “process identifier”).

Despite the plethora of terms, the basic idea is the same: a number that refers to a resource.

Other Options

There are also other ways to refer to resources, including

Hierarchical Numbering — IP addresses, for example, are numerical but also hierarchical, where the sequence of numbers captures the hierarchy of the network itself.
System-Generated Strings — In principle, the operating system could name resources with strings, but this approach is less efficient and harder to check for validity.
User-Provided Strings — In some cases, the operating system might allow user programs to name resources themselves. The most obvious example is file names, but POSIX also provides named semaphores and named message queues.

Scope and Lifetime

Two other important aspects of a resource identifier are its scope and lifetime.

Scope is where the identifier is valid. For example, file descriptors are usually only valid within specific processes; process IDs are only valid within a single system; but an IP address is valid across the entire Internet.

Lifetime is how long the identifier is valid. For example, a file descriptor is only valid as long as the file is open, while a process ID is valid as long as the process is running.

Reference Counting

Sometimes we can have multiple identifiers that refer to the same resource. For example, POSIX provides the dup system call, which duplicates a file descriptor. In this case, we need to ensure that the resource is only freed when all the identifiers have been closed. We can handle this case by using reference counting, where we keep track of how many identifiers refer to a particular resource, and only free that resource when the count drops to zero.

Aren't we going to have to implement file descriptors in OS/161?
We are. And that's why we'll dive a little deeper into how file descriptors work in the next section.

(When logged in, completion status appears here.)

Go to Parent Page Next Page

CS 134

Warning: Missing Prerequisites