Lecture 18: Pipes, Filters, and Writing Your Own
The Unix Filter Idea
A filter reads from standard input, does something to the data,
and writes to standard output. Many Unix commands work this
way—examples include sort, grep, uniq, tr, cut, wc, cat
-n, sed, and many more.
Pipes (|) connect filters together: the output of one program
becomes the input for the next. Each program does one small thing,
but the whole pipeline does something powerful.
Playing with Pipes
Here are some things to try with our oliver.txt demo file:
cat oliver.txt
cat -n oliver.txt
wc oliver.txt
grep -i oliver oliver.txt
grep -v SPOILER oliver.txt
grep -v SPOILER oliver.txt | wc
sed '/^SPOILER/d' oliver.txt
And some fun with /usr/share/dict/words:
grep "uu" /usr/share/dict/words
grep "^q" /usr/share/dict/words | grep -v "^qu"
grep "^...$" /usr/share/dict/words | head
What letter do most words start with?
cut -c1 < /usr/share/dict/words | sort | uniq -c | sort -rn | head
Try building that pipeline up one step at a time to see what each stage does.
Writing a Filter: shout
Our first filter converts its input to uppercase.
Version 1: System Calls, One Byte at a Time
#include <unistd.h>
#include <ctype.h>
int main() {
char c;
while (1) {
int n = read(0, &c, 1);
if (n <= 0)
break;
c = toupper(c);
write(1, &c, 1);
}
return 0;
}
This approach works, but it makes two system calls per
character—one read and one write. System calls are expensive,
which means this filter runs painfully slowly on large files.
Version 2: Buffered System Calls
Read a whole chunk at a time, transform it, write the chunk:
#include <unistd.h>
#include <ctype.h>
#define BUFSIZE 4096
int main() {
char buf[BUFSIZE];
while (1) {
int n = read(0, buf, BUFSIZE);
if (n <= 0)
break;
for (int i = 0; i < n; i++)
buf[i] = toupper(buf[i]);
write(1, buf, n);
}
return 0;
}
This version is much faster! But we're still being sloppy—write
might not write everything we asked it to, and a signal could
interrupt either call.
Version 3: Robust Buffered I/O
Here's what it looks like when you handle short writes and EINTR properly:
#include <unistd.h>
#include <ctype.h>
#include <errno.h>
#define BUFSIZE 4096
int main() {
char buf[BUFSIZE];
while (1) {
ssize_t n = read(0, buf, BUFSIZE);
if (n < 0) {
if (errno == EINTR)
continue;
break;
}
if (n == 0)
break;
for (ssize_t i = 0; i < n; i++)
buf[i] = toupper(buf[i]);
ssize_t written = 0;
while (written < n) {
ssize_t w = write(1, buf + written, n - written);
if (w < 0) {
if (errno == EINTR)
continue;
return 1;
}
written += w;
}
}
return 0;
}
Notice the write loop: buf + written advances through the buffer as bytes
are successfully written. And both read and write retry on EINTR
rather than giving up.
But that's a lot more code to write and debug.
Version 4: Let the C Library Do Its Job
All that buffering and retry logic is pretty complicated, but the C
library comes to the rescue! stdio can do it all for you, and
lets you concentrate on the code that's unique:
#include <stdio.h>
#include <ctype.h>
int main() {
int c;
while ((c = getchar()) != EOF)
putchar(toupper(c));
return 0;
}
getchar and putchar use a buffer behind the scenes. This
version is just as fast as version 2, and is much easier to read.
For filters, this is the right level of abstraction.
Writing words
The words filter splits input into one word per line—only alphabetic
characters count as word characters. It's a tiny state machine: you're
either in a word or you're not.
#include <stdio.h>
#include <ctype.h>
#include <stdbool.h>
int main() {
int c;
bool in_word = false;
while ((c = getchar()) != EOF) {
if (isalpha(c)) {
putchar(c);
in_word = true;
} else if (in_word) {
putchar('\n');
in_word = false;
}
}
return 0;
}
Now we can do word frequencies,
./words < oliver.txt | sort | uniq -c | sort -rn | head
The -c flag for uniq tells it to count the number of
occurrences of each word.
For the second sort, -n tells it to sort its input
numerically, and -r tells it to sort in reverse order.
or a quick-and-dirty spellchecker,
./words < oliver.txt | sort -uf | grep -vif /usr/share/dict/words
Here, the -u flag to sort drops duplicates (like uniq) and the
-f flag (fold) makes it ignore case differences.
For grep, the -v flag reverses the pattern matching (i.e.,
it reports lines that don't match the given pattern); the -i
flag tells grep to treat its pattern as case-insensitive; and
the -f flag tells it to read patterns from a filename provided
as an argument.
How Arguments Work
Before we improve words, let's take a quick look at argc and argv:
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("argc = %d\n", argc);
for (int i = 0; i < argc; i++)
printf("argv[%d] = \"%s\"\n", i, argv[i]);
return 0;
}
Running ./printargs -c "'-" prints
argc = 3
argv[0] = "./printargs"
argv[1] = "-c"
argv[2] = "'-"
Notice that the shell strips the quotes—the program sees '- not "'-".
The shell also handles
- I/O Redirection
./words < oliver.txtmakesoliver.txtthe standard input forwords.- Globbing
*.txtexpands to all files ending in.txt.- Variable Expansion
"$HOME"expands to the value of theHOMEenvironment variable.- Command Substitution
$(date)expands to the output of thedatecommand.
Adding the -c Flag to words
Our spellchecker has a problem: “don't” becomes “don” and “t”, and “t” shows up as misspelled. We need a way to say “these characters also count as word characters”.
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int main(int argc, char *argv[]) {
const char *extra = "";
if (argc == 3 && strcmp(argv[1], "-c") == 0)
extra = argv[2];
int c;
int in_word = 0;
while ((c = getchar()) != EOF) {
if (isalpha(c) || strchr(extra, c)) {
putchar(c);
in_word = 1;
} else if (in_word) {
putchar('\n');
in_word = 0;
}
}
return 0;
}
The trick is strchr(extra, c)—it searches the extra-characters string
for c. When extra is "" (the default), strchr returns NULL for
every character, so nothing changes.
Now the spellchecker can handle contractions and hyphenated words, so running
./words -c "'-" < oliver.txt | sort -uf | grep -vwif /usr/share/dict/words
produces
adapted
asks
conditions
escapes
Fagin
falls
films
grows
has
including
Laws
litterature
London
named
novels
pickpockets
published
remanes
says
tells
terrable
workhouses
Note that the dictionary (/usr/share/dict/words) doesn't include plurals, so it gives some false positives.
Bonus: Reading Lines with getline
Often we want to read lines, but don't know ahead of time how long they will be. getline is a convenient function that is part of the standard C library (on a POSIX system) that handles this task for you. It takes care of allocating and resizing a buffer as needed, and it returns the length of the line read. You can read more about how it works by running man 3 getline, but the code below shows the basics of how to use it.
The program rev reads lines from standard input and reverses the characters in each line. It uses getline to read lines of arbitrary length. The coding pattern used here is common when using getline:
/* Reads stdin and outputs each line reversed, demos getline() */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
char *line = NULL; // will be allocated by getline()
size_t capacity = 0; // capacity of the line buffer
ssize_t length; // length of the line read, or -1 on EOF/error
while ((length = getline(&line, &capacity, stdin)) != -1) {
// Reduce length by 1 if the line ends with a newline character
// because we don't want to move the newline when reversing the line.
if (length > 0 && line[length - 1] == '\n') {
--length;
}
// Reverse the line in place (excluding the newline character)
for (size_t i = 0; i < length / 2; ++i) {
char temp = line[i];
line[i] = line[length - 1 - i];
line[length - 1 - i] = temp;
}
// Print the reversed line (including the newline character if present)
fputs(line, stdout);
}
free(line); // free the buffer allocated by getline()
return 0;
}
Bonus: Argument Parsing with getopt
The getopt function is a convenient way to parse command-line arguments. It handles flags and their arguments, and it can generate usage messages for you. You can read more about how it works by running man 3 getopt, but the code below shows the basics of how to use it.
count is a simple counter program that takes options to specify how to count. It uses getopt to parse the command-line arguments:
/* Example of option processing using getopt
*
* Usage:
* ./count [-r] [-s START] [-i INCREMENT] COUNT
*
* This program outputs COUNT numbers, starting at START (default 0)
* and incrementing by INCREMENT (default 1). The -r flag causes the numbers
* to be output in reverse order.
*
* For example, `./count -s 10 -i 2 5` would output:
* 10
* 12
* 14
* 16
* 18
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void usage() {
fprintf(stderr, "Usage: count [-r] [-s START] [-i INCREMENT] COUNT\n");
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
int reverse = 0; // whether to output in reverse order
int start = 0; // starting number (default 0)
int increment = 1; // increment (default 1)
// Process options using getopt
int opt;
while ((opt = getopt(argc, argv, "rs:i:")) != -1) {
switch (opt) {
case 'r':
reverse = 1;
break;
case 's':
start = atoi(optarg);
break;
case 'i':
increment = atoi(optarg);
break;
default:
fprintf(stderr, "Unrecognized option: -%c\n", optopt);
usage(); // terminate with usage message
}
}
// After processing options, optind is the index of the first non-option
// argument
if (optind >= argc) {
fprintf(stderr, "Expected COUNT argument after options\n");
usage();
}
int count = atoi(argv[optind]); // the COUNT argument
if (reverse) {
for (int i = count - 1; i >= 0; i--) {
printf("%d\n", start + i * increment);
}
} else {
for (int i = 0; i < count; i++) {
printf("%d\n", start + i * increment);
}
}
return EXIT_SUCCESS;
}
The third argument to getopt is a string that specifies the valid options. In this program, we specified "rs:i:", which means that it takes three options: -r, -s and -i. Both s and i are followed by a colon, which means they require an argument (e.g., -s 10 or -i 5). The r option does not require an argument (i.e., it's just -r).
You'll also notice that the code uses some global variables that getopt provides: optind, which is the index of the next argument to be processed, and optarg, which is the argument for the current option (if it requires one). After the loop, optind will point to the first non-option argument, if there are any. See the man page for more details.
Bonus: String Searching with strstr
The C library provides a number of convenient functions for working with NUL-terminated strings (type man 3 string to see them all). One of these is strstr, which searches for a substring within a string. It returns a pointer to the first occurrence of the substring, or NUL if the substring is not found. You can read more about how it works by running man 3 strstr, but the code below shows the basics of how to use it.
The program contains checks if its input contains a specified substring:
/* Takes two string argument, and says whether the first contains the second.
* Demos strstr()
*
* Example usage:
* ./contains "hello world" "world" # outputs: yes
* ./contains "hello world" "foo" # outputs: no
*/
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "Usage: %s STRING SUBSTRING\n", argv[0]);
return 1;
}
const char *string = argv[1];
const char *substring = argv[2];
if (strstr(string, substring) != NULL) {
printf("yes\n");
} else {
printf("no\n");
}
return 0;
}
(When logged in, completion status appears here.)