CS 105

Lecture 18: Pipes, Filters, and Writing Your Own

The Unix Filter Idea

A filter reads from standard input, does something to the data, and writes to standard output. Many Unix commands work this way—examples include sort, grep, uniq, tr, cut, wc, cat -n, sed, and many more.

Pipes (|) connect filters together: the output of one program becomes the input for the next. Each program does one small thing, but the whole pipeline does something powerful.

Playing with Pipes

Here are some things to try with our oliver.txt demo file:

cat oliver.txt
cat -n oliver.txt
wc oliver.txt
grep -i oliver oliver.txt
grep -v SPOILER oliver.txt
grep -v SPOILER oliver.txt | wc
sed '/^SPOILER/d' oliver.txt

And some fun with /usr/share/dict/words:

grep "uu" /usr/share/dict/words
grep "^q" /usr/share/dict/words | grep -v "^qu"
grep "^...$" /usr/share/dict/words | head

What letter do most words start with?

cut -c1 < /usr/share/dict/words | sort | uniq -c | sort -rn | head

Try building that pipeline up one step at a time to see what each stage does.

Writing a Filter: shout

Our first filter converts its input to uppercase.

Version 1: System Calls, One Byte at a Time

#include <unistd.h>
#include <ctype.h>

int main() {
    char c;
    while (1) {
        int n = read(0, &c, 1);
        if (n <= 0)
            break;
        c = toupper(c);
        write(1, &c, 1);
    }
    return 0;
}

This approach works, but it makes two system calls per character—one read and one write. System calls are expensive, which means this filter runs painfully slowly on large files.

Version 2: Buffered System Calls

Read a whole chunk at a time, transform it, write the chunk:

#include <unistd.h>
#include <ctype.h>

#define BUFSIZE 4096

int main() {
    char buf[BUFSIZE];
    while (1) {
        int n = read(0, buf, BUFSIZE);
        if (n <= 0)
            break;
        for (int i = 0; i < n; i++)
            buf[i] = toupper(buf[i]);
        write(1, buf, n);
    }
    return 0;
}

This version is much faster! But we're still being sloppy—write might not write everything we asked it to, and a signal could interrupt either call.

Version 3: Robust Buffered I/O

Here's what it looks like when you handle short writes and EINTR properly:

#include <unistd.h>
#include <ctype.h>
#include <errno.h>

#define BUFSIZE 4096

int main() {
    char buf[BUFSIZE];
    while (1) {
        ssize_t n = read(0, buf, BUFSIZE);
        if (n < 0) {
            if (errno == EINTR)
                continue;
            break;
        }
        if (n == 0)
            break;

        for (ssize_t i = 0; i < n; i++)
            buf[i] = toupper(buf[i]);

        ssize_t written = 0;
        while (written < n) {
            ssize_t w = write(1, buf + written, n - written);
            if (w < 0) {
                if (errno == EINTR)
                    continue;
                return 1;
            }
            written += w;
        }
    }
    return 0;
}

Notice the write loop: buf + written advances through the buffer as bytes are successfully written. And both read and write retry on EINTR rather than giving up.

But that's a lot more code to write and debug.

Version 4: Let the C Library Do Its Job

All that buffering and retry logic is pretty complicated, but the C library comes to the rescue! stdio can do it all for you, and lets you concentrate on the code that's unique:

#include <stdio.h>
#include <ctype.h>

int main() {
    int c;
    while ((c = getchar()) != EOF)
        putchar(toupper(c));
    return 0;
}

getchar and putchar use a buffer behind the scenes. This version is just as fast as version 2, and is much easier to read. For filters, this is the right level of abstraction.

Writing words

The words filter splits input into one word per line—only alphabetic characters count as word characters. It's a tiny state machine: you're either in a word or you're not.

#include <stdio.h>
#include <ctype.h>
#include <stdbool.h>

int main() {
    int c;
    bool in_word = false;
    while ((c = getchar()) != EOF) {
        if (isalpha(c)) {
            putchar(c);
            in_word = true;
        } else if (in_word) {
            putchar('\n');
            in_word = false;
        }
    }
    return 0;
}

Now we can do word frequencies,

./words < oliver.txt | sort | uniq -c | sort -rn | head

The -c flag for uniq tells it to count the number of occurrences of each word.

For the second sort, -n tells it to sort its input numerically, and -r tells it to sort in reverse order.

or a quick-and-dirty spellchecker,

./words < oliver.txt | sort -uf | grep -vif /usr/share/dict/words

Here, the -u flag to sort drops duplicates (like uniq) and the -f flag (fold) makes it ignore case differences.

For grep, the -v flag reverses the pattern matching (i.e., it reports lines that don't match the given pattern); the -i flag tells grep to treat its pattern as case-insensitive; and the -f flag tells it to read patterns from a filename provided as an argument.

How Arguments Work

Before we improve words, let's take a quick look at argc and argv:

#include <stdio.h>

int main(int argc, char *argv[]) {
    printf("argc = %d\n", argc);
    for (int i = 0; i < argc; i++)
        printf("argv[%d] = \"%s\"\n", i, argv[i]);
    return 0;
}

Running ./printargs -c "'-" prints

argc = 3
argv[0] = "./printargs"
argv[1] = "-c"
argv[2] = "'-"

Notice that the shell strips the quotes—the program sees '- not "'-".

The shell also handles

I/O Redirection
./words < oliver.txt makes oliver.txt the standard input for words.
Globbing
*.txt expands to all files ending in .txt.
Variable Expansion
"$HOME" expands to the value of the HOME environment variable.
Command Substitution
$(date) expands to the output of the date command.

Adding the -c Flag to words

Our spellchecker has a problem: “don't” becomes “don” and “t”, and “t” shows up as misspelled. We need a way to say “these characters also count as word characters”.

#include <stdio.h>
#include <ctype.h>
#include <string.h>

int main(int argc, char *argv[]) {
    const char *extra = "";
    if (argc == 3 && strcmp(argv[1], "-c") == 0)
        extra = argv[2];

    int c;
    int in_word = 0;
    while ((c = getchar()) != EOF) {
        if (isalpha(c) || strchr(extra, c)) {
            putchar(c);
            in_word = 1;
        } else if (in_word) {
            putchar('\n');
            in_word = 0;
        }
    }
    return 0;
}

The trick is strchr(extra, c)—it searches the extra-characters string for c. When extra is "" (the default), strchr returns NULL for every character, so nothing changes.

Now the spellchecker can handle contractions and hyphenated words, so running

./words -c "'-" < oliver.txt | sort -uf | grep -vwif /usr/share/dict/words

produces

adapted
asks
conditions
escapes
Fagin
falls
films
grows
has
including
Laws
litterature
London
named
novels
pickpockets
published
remanes
says
tells
terrable
workhouses

Note that the dictionary (/usr/share/dict/words) doesn't include plurals, so it gives some false positives.

Bonus: Reading Lines with getline

Often we want to read lines, but don't know ahead of time how long they will be. getline is a convenient function that is part of the standard C library (on a POSIX system) that handles this task for you. It takes care of allocating and resizing a buffer as needed, and it returns the length of the line read. You can read more about how it works by running man 3 getline, but the code below shows the basics of how to use it.

The program rev reads lines from standard input and reverses the characters in each line. It uses getline to read lines of arbitrary length. The coding pattern used here is common when using getline:

/* Reads stdin and outputs each line reversed, demos getline() */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    char *line = NULL;    // will be allocated by getline()
    size_t capacity = 0;  // capacity of the line buffer
    ssize_t length;       // length of the line read, or -1 on EOF/error

    while ((length = getline(&line, &capacity, stdin)) != -1) {
        // Reduce length by 1 if the line ends with a newline character
        // because we don't want to move the newline when reversing the line.
        if (length > 0 && line[length - 1] == '\n') {
            --length;
        }

        // Reverse the line in place (excluding the newline character)
        for (size_t i = 0; i < length / 2; ++i) {
            char temp = line[i];
            line[i] = line[length - 1 - i];
            line[length - 1 - i] = temp;
        }

        // Print the reversed line (including the newline character if present)
        fputs(line, stdout);
    }

    free(line); // free the buffer allocated by getline()
    return 0;
}

Bonus: Argument Parsing with getopt

The getopt function is a convenient way to parse command-line arguments. It handles flags and their arguments, and it can generate usage messages for you. You can read more about how it works by running man 3 getopt, but the code below shows the basics of how to use it.

count is a simple counter program that takes options to specify how to count. It uses getopt to parse the command-line arguments:

/* Example of option processing using getopt
 *
 * Usage:
 *  ./count [-r] [-s START] [-i INCREMENT] COUNT
 *
 * This program outputs COUNT numbers, starting at START (default 0)
 * and incrementing by INCREMENT (default 1).   The -r flag causes the numbers
 * to be output in reverse order.
 *
 * For example, `./count -s 10 -i 2 5` would output:
 * 10
 * 12
 * 14
 * 16
 * 18
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void usage() {
    fprintf(stderr, "Usage: count [-r] [-s START] [-i INCREMENT] COUNT\n");
    exit(EXIT_FAILURE);
}

int main(int argc, char *argv[]) {
    int reverse = 0; // whether to output in reverse order
    int start = 0;   // starting number (default 0)
    int increment = 1; // increment (default 1)

    // Process options using getopt
    int opt;
    while ((opt = getopt(argc, argv, "rs:i:")) != -1) {
        switch (opt) {
            case 'r':
                reverse = 1;
                break;
            case 's':
                start = atoi(optarg);
                break;
            case 'i':
                increment = atoi(optarg);
                break;
            default:
                fprintf(stderr, "Unrecognized option: -%c\n", optopt);
                usage();  // terminate with usage message
        }
    }

    // After processing options, optind is the index of the first non-option
    // argument
    if (optind >= argc) {
        fprintf(stderr, "Expected COUNT argument after options\n");
        usage();
    }

    int count = atoi(argv[optind]); // the COUNT argument

    if (reverse) {
        for (int i = count - 1; i >= 0; i--) {
            printf("%d\n", start + i * increment);
        }
    } else {
        for (int i = 0; i < count; i++) {
            printf("%d\n", start + i * increment);
        }
    }

    return EXIT_SUCCESS;
}

The third argument to getopt is a string that specifies the valid options. In this program, we specified "rs:i:", which means that it takes three options: -r, -s and -i. Both s and i are followed by a colon, which means they require an argument (e.g., -s 10 or -i 5). The r option does not require an argument (i.e., it's just -r).

You'll also notice that the code uses some global variables that getopt provides: optind, which is the index of the next argument to be processed, and optarg, which is the argument for the current option (if it requires one). After the loop, optind will point to the first non-option argument, if there are any. See the man page for more details.

Bonus: String Searching with strstr

The C library provides a number of convenient functions for working with NUL-terminated strings (type man 3 string to see them all). One of these is strstr, which searches for a substring within a string. It returns a pointer to the first occurrence of the substring, or NUL if the substring is not found. You can read more about how it works by running man 3 strstr, but the code below shows the basics of how to use it.

The program contains checks if its input contains a specified substring:

/* Takes two string argument, and says whether the first contains the second.
 * Demos strstr()
 *
 * Example usage:
 *  ./contains "hello world" "world"   # outputs: yes
 *  ./contains "hello world" "foo"     # outputs: no
 */

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: %s STRING SUBSTRING\n", argv[0]);
        return 1;
    }

    const char *string = argv[1];
    const char *substring = argv[2];

    if (strstr(string, substring) != NULL) {
        printf("yes\n");
    } else {
        printf("no\n");
    }

    return 0;
}

(When logged in, completion status appears here.)