Friday, October 10, 2008

Using Perl like awk and sed

It looks like the designer of Perl really wanted to make it a viable awk and sed alternative. It is possible to run perl using command line flags that makes it behave much like awk and sed.

The simplest use, like egrep (outputs all lines that matches a regular expression), is this. The commands listed below are all equivalent.
cat | egrep 'pattern'
cat | awk '/pattern/ { print }'
cat | sed -n '/pattern/ p'
cat | perl -ne 'print if /pattern/'
Here we look at the Perl case more closely. The statement print if /pattern/ is certainly valid Perl code. It is carefully designed so:
  1. The syntax "statment if expression" is the same as 'if (expression) { statement; }'.
  2. The expression for pattern matching, usually written as '$value =~ /pattern/', can be abbreviated as simply '/pattern/' or 'm{pattern}'. The default value is drawn from $_ (a built-in variable).
  3. If the argument to print is missing, it prints the value of $_.
Alternatively, we can write instead:
cat | perl -ne '/pattern/ and print'
which is the same thing, relying on the fact that the 'and' operator short-circuits.

The command line flags -ne accomplish the following:
  • -e is used to specify the expression to evaluate.
  • -n wraps the expression inside a while loop that places each input line into $_ and evaluate the expression.
Alternatively, there is also a -p flag which replaces -n, and it allows Perl to simulate sed:
  • -p wraps the expression inside a while loop, placing each input line into $_, evaluate the expression which manipulates $_, and prints $_, the result.
Here is an example (note that awk, sed and Perl have slightly different regular expression syntax and flags):
cat | sed 's/pattern/replacement/flags'
cat | perl -pe 's/pattern/replacement/flags'
Again, this works because regular expression substitution in perl, normally written as '$value =~ s/pattern/replacement/flags' or '$value =~ s{pattern}{replacement}flags', operates on $_ by default.

Here are a few flags that make Perl more awk like, with field separators.
  • -l makes each print statement output a record separator that is the same as input record separator (newline by default).
  • -Fpattern is used to specify input field separator, much like awk's -F option.
  • -a turns on the autosplit mode, so input fields are placed into @F array.
A good mnemonic is perl -Fpattern -lane 'expression'. Example:
cat /etc/passwd | awk -F: '{ print $1 }'
cat /etc/passwd | perl -F: -lane 'print @F[0]'
Note that Perl fields are @F[0], @F[1], ...; awk fields are $1, $2, ... instead. However, awk $0 (the whole input line) corresponds to $_ in Perl.

If we want to combine regular expression matching and field separation, we might have something like:
find . | awk -F/ '/hw[0-9]+/ { print $1 }'
find . | perl -F/ -lane 'print @F[0] if /hw[0-9]+/'
Many awk variables have their Perl equivalents as well. However, in order to use them, the -MEnglish flag must be passed to Perl like this:
cat | awk '{ print NR, $0 }'
cat | perl -MEnglish -ne 'print $NR, " ", $_'
Most notably, the commas in the Perl print statement does not normally print out an output field separator. To get a behavior more like awk, do this:
cat | awk 'BEGIN { OFS = ": " } { print NR, $0 }'
cat | perl -MEnglish -ne 'BEGIN { $OFS = ": " } print $NR, $_'
In conclusion, Perl does seem very ambitious to make itself very awk or sed like. Both sed and awk also come with pretty comprehensive programming constructs, but it is nice how Perl is like a grand unified text processing and reporting tool.

3 comments:

Nik said...

Very useful post!

Thanks!!

ack said...

Perl can be made even more Awk-like using this:

$ echo foo bar baz | awk '/foo/ { print $2; }'

$ echo foo bar baz | perl -lane '/foo/ and do { print $F[2]; }'

Isn't Perl great? :-)

ack said...

ha, copy&paste gotcha, the Awk example would need to use $3 of course :-)