Saturday, November 30, 2013

To err is human; to out, pipeline.

Many introductory programming classes begin with the hello world example like this:
#include <stdio.h>

int main() {
  printf("hello world!\n");
  return 0;
}
From this point on, the notion of writing human readable messages from a program to “standard output” is cemented in the student's mind, and a student continues to do so for the rest of his professional career. This is bad because this is not the intended design of computer systems. The intent is that “standard output” (abbreviated as stdout in a program) be machine readable, while “standard error” (abbrev. stderr) is human readable. A corrected version of hello world would be:
#include <stdio.h>

int main() {
  fprintf(stderr, "hello world!\n");
  return 0;
}
Pipeline is a feature of Unix that allows you to chain the standard output of one program to the standard input of another. Simple programs can be arbitrarily composed to perform complex functions. It is also a feature of many operating systems inspired by Unix, such as MS-DOS, and its successor, Windows. Standard error is used to report errors, so you typically don't pipeline standard error.

As an example, if a text file contains your personal address book, one contact per line with space separated fields like this:
$ cat contacts 
John Smith johns@gmail.com 555-1234
John Doe johnd@hotmail.com 555-2345
Jane Smith janes@yahoo.com 555-1234
Jane Doe janed@aol.com 555-2345
Alan Smithee alans@inbox.com 555-3456
Smith Johnson smithj@mail.com 555-4567
Then you could list all last names like this:
$ cut -d' ' -f 2 contacts 
Smith
Doe
Smith
Doe
Smithee
Johnson
Enumerate the Smiths (last name):
$ cut -d' ' -f 2 contacts | grep '^Smith$'
Smith
Smith
And count the Smiths:
$ cut -d' ' -f 2 contacts | grep '^Smith$' | wc
       2       2      12
You can sort and list the unique last names:
$ cut -d' ' -f 2 contacts | sort -u
Doe
Johnson
Smith
Smithee
You can do similar things to email and phone area code, all using simple programs that don't do very much on their own, but can be combined using pipeline to perform specific queries or text manipulations:
  • cut: extracts the fields of input lines.
  • grep: prints input lines matching a pattern.
  • wc: counts lines, words, and bytes.
  • sort: sorts and optionally prints only unique lines.
This doesn't work if one of the programs in the pipeline would output error message to standard output.

For a beginner who might confuse the roles of standard output and standard error, the following mnemonic should help.
To err is human; to out, pipeline.
Maybe humans do err too much.

No comments: