CS 134

C-Style Strings: A Refresher

In CS 70 (and perhaps CS 105), you learned about C-style strings, but it's always good to have a refresher. In this guide, we'll revisit the basics of C-style strings, their memory representation, common operations, and best practices. Understanding C-style strings is essential for systems programming, including OS development.

How Does C Handle Strings?

In C, a string is just an area of memory that holds a sequence of characters. Here are two declarations of C strings:

char greeting[] = "Hello, world!";
const char* magic_word = "Xyzzy";

Whether we use the array syntax or declare a pointer to a string literal, the result is the same: a sequence of characters in memory.

To know where the string ends, we use a special character called the NUL terminator (written as '\0', and just a zero byte in memory). So in memory, our greeting string looks like

H e l l o , w o r l d ! \0
0 1 2 3 4 5 6 7 8 9 10 11 12 13
  • Duck speaking

    Did you mean to say “NULL” termination?

  • PinkRobot speaking

    No. In code, we use NULL to refer to the null pointer (an address). Here we're not talking about pointers, but about the ASCII character with value 0, which has the historical name “NUL” (see this Wikipedia article on control characters). So it's not a typo.

  • Hedgehog speaking

    So should we actually write "Hello, world!\0" to be explicit about the NUL terminator?

  • PinkRobot speaking

    No, that happens automatically when you use a string literal. But if you're initializing an array of characters, you would need to include the NUL terminator yourself.

    For example, you could define greeting as

    char greeting[] = {'H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!', '\0'};
    
  • Goat speaking

    Meh. No thanks. I'll stick with the string literal.

  • Duck speaking

    But NULL—sorry—NUL termination is so-o-o inefficient. To find the length of the string, we have to scan the entire string until we find the NUL terminator. Why not store the length of the string along with the string itself?

  • Cat speaking

    NUL termination also means we can't store NUL characters in the string itself. That's a limitation.

  • PinkRobot speaking

    Good points! And in C++, the std::string class does store the length to avoid those issues.

  • BlueRobot speaking

    But in practice, strings are usually short, so the overhead of scanning for the NUL terminator is negligible. And the simplicity of NUL-terminated strings makes them very efficient for many purposes.

Although there are some downsides to C's choice, it's arguably simpler to just have an array of characters rather than an array of characters and a length field. And this approach does allow some tricks. For example, we can turn greeting into two separate strings:

greeting[5] = '\0';         // Overwrite the comma with a NUL terminator
char* hello = greeting;     // Points to "Hello"
char* world = greeting + 7; // Points to "world!"

In a memory-constrained environment, this trick can be useful for saving space by not copying the string, but instead just transforming it in place.

Common Operations and string.h

The string.h header provides essential functions for C-style string manipulation:

Function Use
strlen(s) Returns the length of string s (excluding the NUL terminator).
strcpy(dest, src) Copies src to dest.
strcat(dest, src) Concatenates src to the end of dest.
strcmp(s1, s2) Compares s1 and s2.
strchr(s, c) Finds the first occurrence of character c in s.
strstr(haystack, needle) Finds the first occurrence of needle in haystack.
strtok(s, delims) Tokenizes s using delimiters delims.


Example usage:

#include <string.h>
#include <stdio.h>
#include <stdlib.h>

#define LITTLE_BUFFER_SIZE 50

int
main() {
        char str1[LITTLE_BUFFER_SIZE] = "Operating";    // Writeable buffer
        const char *str2 = "Super Systems";             // Fixed literal
        char str3[LITTLE_BUFFER_SIZE];
        char *str4;
        const char *cp;

        printf("Length of str1: %zu\n", strlen(str1));
        cp = strstr(str2, " S");                        // Find some substring

        strcat(str1, cp);
        printf("Concatenated string: %s\n", str1);

        strcpy(str3, str1);
        str3[5] = '\0';  // Shorten the string
        printf("First 5 characters: %s\n", str3);

        str4 = strdup(str3);    // Make a duplicate of str3 on the heap
        printf("Duplicated string: %s\n", str4);
        free(str4);             // Free the heap memory

        return 0;
}
  • Rabbit speaking

    This code is actually pretty dangerous. It's using fixed-size buffers without checking the size of the strings. That's a recipe for buffer overflows.

  • PinkRobot speaking

    It's true. However, there's a bit of disagreement about the right fix. Historically, people used strncpy and strncat to limit the number of characters copied. But those functions are a bit awkward to use, and they don't guarantee NUL termination. So they can be error-prone.

    Various alternatives have been proposed; for example, BSD (and macOS) recommend strlcpy and strlcat, which guarantee NUL termination. But those functions aren't standard, so they're not available everywhere. So it's a bit of a mess.

  • Duck speaking

    I could write my own.

  • PinkRobot speaking

    Indeed we could. For portable code that does what you actually want, it might even be the best solution. But it's a bit of a pain.

Here's a safer version of strcpy where we specify the end of the buffer:

/* safer_strcpy(dest, dest_end, src): copy a string into a buffer
 *   dest: destination buffer
 *   dest_end: past-the-end (should be dest + buffer size)
 *   src: source string
 *
 * - returns the address of the null terminator on success (supports chaining)
 * - returns NULL if the buffer is too small
 *
 * Note: In safe code, you need to *check* the return value!
 */
char *
safer_strcpy(char *dest, char *dest_end, const char *src)
{
        char c;

        do {
                if (dest >= dest_end) {     // Out of space? Bail out
                        return NULL;
                }
                c = *src++;                 // Copy a character
                *dest++ = c;
        } while (c != '\0');                // Done when we copied NUL
        return --dest;                      // Back up to be on the NUL
}
  • Try in OnlineGDB
    • Try changing LITTLE_BUFFER_SIZE to 10 to see what happens when the buffer is too small.

As we can see here, one nice thing about the simplicity of C strings is that if the C library doesn't provide the helper function you need, you can write it yourself pretty easily.

Comparison with C++ std::string

While std::string offers convenience and safety, C-style strings are still crucial in systems programming. Here's a quick comparison:

Feature C-style strings std::string
Memory management Manual Automatic
Bounds checking None (unsafe) Possible
NUL termination Required Handled internally
Performance Generally faster Slight overhead
Compatibility C and C++ C++ only

Best Practices and Pitfalls

  1. Always ensure strings are NUL-terminated.
  2. Avoid code that could cause a buffer overflow (either avoid fixed-size buffers or be sure to check buffer sizes somehow).
  3. Check return values of string functions for error handling.
  4. Be cautious with user input to avoid buffer overflow vulnerabilities.
  5. Remember that string literals are immutable.

(When logged in, completion status appears here.)