Parsing the Command Line with Getopt::*

Programs need input, and for many programs, input begins on the command line. For example, a program to print files might be invoked as
pr -l -n -a 10:00 foo bar
pr is the name of the program; it is followed here by six arguments. The arguments are of two sorts: options and file names.

-l, -n, and -a 10:00 are options. They control the manner in which the program executes. In this case, -l tells pr to print in landscape orientation, -n tells it to print page numbers, and -a 10:00 tells it to print after 10:00. Options are sometimes called flags or switches. foo and bar are file names. pr will read these files to obtain the actual text to print.

Parsing the Command Line

In Perl, command line arguments are made available to the program in the global @ARGV array. This happens automatically: you don't have to declare anything or do anything to get them. If you wrote pr in Perl and entered the command line shown above, then at the beginning of program execution @ARGV would have six elements:
$ARGV[0]	'-l'
$ARGV[1]	'-n'
$ARGV[2]	'-a'
$ARGV[3]	'10:00'
$ARGV[4]	'foo'
$ARGV[5]	'bar'
Now the fun starts. Given @ARGV as shown, the program has to identify -l, -n, and -a as options, associate 10:00 with -a, and identify foo and bar as file names. This is called parsing the command line.

Easy to be Hard

Parsing the command line is a problem. The problem isn't that it is so hard, but rather that it is so easy: for many programs, it can be done in under 20 lines of code. Because parsing the command line seems easy, it is often not identified as a distinct function of the program. It never gets a functional specification, or a design, or even the considered attention of the programmer. This leads to many bad things:

Design by accretion

As the program evolves, parsing features are added on an ad-hoc basis.

Open code

The command line is parsed in open code, scattered across the program. The parsing code isn't contained in any subroutine, module, or class.

Non-standard interfaces

Different programs are liable to parse the command line in slightly different ways. This confuses users.

Sub-standard interfaces

Programmers tend to implement only what they need—or think they need. Features such as switch clustering, abbreviations, and help text, for example, may be omitted.

Bugs

When every program has its own parsing code, every program can have its own parsing bugs.

Duplication

You keep writing that same 50 lines of code, over and over again, in every program.

The Eightfold Path

In Perl, there is a better way. In fact, there are many better ways. In 00modlist.long.html, we find
Getopt::Declare      An easy-to-use WYSIWYG command-line parser
Getopt::EvaP         Long/short options, multilevel help          
Getopt::Long         Advanced option handling                     
Getopt::Mixed        Supports both long and short options         
Getopt::Regex        Option handling using regular expressions    
Getopt::Simple       A simplified interface to Getopt::Long       
Getopt::Std          Implements basic getopt and getopts          
Getopt::Tabular      Table-driven argument parsing with help text 
Each of these is a Perl module for parsing the command line. They have been designed, written, debugged, and encapsulated. You don't have to write them again. They support standard interfaces.

Getopt::Std

If the list above seems daunting, start with Getopt::Std. Getopt::Std supports a good, simple command line style that is adequate for many programs. It automatically handles options given in any of these forms:
pr -l -n -a 10:30 foo bar
pr -lna 10:30 foo bar
pr -lna10:30 foo bar
To use Getopt::Std, write
use Getopt::Std;
my %Options;
getopt('a', \%Options);
Getopt::Std exports the getopt() routine. getopt() takes two parameters: a string and a hash reference. The string lists all the options that take arguments. The hash receives the options found on the command line.

getopt() removes the options from @ARGV and parses them. Upon return, each option appears as a hash key in %Options. For each key, the hash value is the argument of the option if it takes one, and 1 if it does not. Finally, any file names that follow the options are left in @ARGV for the program to process. For any of the command lines shown above, getopt() would set %Options and @ARGV to

%Options = (l => 1,
            n => 1,
            a => '10:30')

@ARGV    = qw(foo bar)
Getopt::Std also has another interface:
$ok = getopts('a:ln', \%Options);
Like getopt(), getopts() takes a string and a hash reference. The string includes all the option letters: both those that take arguments and those that do not. Option letters that take an argument are marked with a trailing colon. Because getopts() has a list of all the valid options, it can do some simple error checking: getopts() returns false if there are invalid options on the command line, and true otherwise.

Getopt::Long

If you need more power that Getopt::Std provides, consider using Getopt::Long. The name ::Long refers to an option style that uses two dashes and the complete option name, rather than a single character:
pr --landscape --numbers --after 10:30 foo bar
However, Getopt::Long is not merely Getopt::Std with a facelift. It provides a large—some would say bewildering—assortment of facilities for parsing the command line in different ways. In addition, Getopt::Long has evolved over the last ten years, reflecting changes in the underlying Perl language, changes in programming style, and changes in interface style. At the same time, it maintains backward compatibility with previous versions.

All this makes the programming interface to Getopt::Long large and complex. For a complete description, you should read the documentation that is contained within the module itself. Here, I'll give just a brief survey, illustrating the simpler features, and reflecting current style.

Basic Facilities

Conceptually, the interface to Getopt::Long is similar to that of Getopt::Std. It exports a routine named GetOptions(). GetOptions() takes a series of option specifiers, which tell it how to parse the command line, and a hash reference, where it stores the results. It returns true if there are no errors.
$ok = GetOptions(\%Options, "landscape", "numbers!", "after=s");
Each option specifier gives the name of an option, possibly followed by an argument specifier. The name will become a hash key. The argument specifier tells how to parse the argument to that option.

In the example above, landscape has no argument specifier. This means that it takes no argument, and $Options{landscape} will be 1 or 0 according as --landscape does or does not appear on the command line. numbers! also takes no argument, but the ! means that it may be explicitly negated by prefixing it with no on the command line:

pr --nonumbers
after=s takes a string argument; the argument will become the value of $Options{after}. after:s also takes a string argument, but the colon means that the argument is optional. Other argument specifiers are =i for integer arguments and =f for floating point arguments.

A double dash on the command line terminates the option list.

Hairy Stuff

If an argument specifier is suffixed with an @, then the option may be given multiple times on the command line, and the corresponding value in %Options becomes a reference to an array containing all the values supplied for that option. For example:
GetOptions(\%Options, "x=f@", "y=f@")
will parse
graph --x 1 --x 2 --x 3 --y 1 --y 4 --y 9
resulting in
%Options = (x => [1, 2, 3],
            y => [1, 4, 9])
Similarly, if an argument specifier is suffixed with a %, then the option takes key=value pairs, and the corresponding value in %Options becomes a reference to a hash of those pairs. So
GetOptions(\%Options, "define=s%")
will parse Stroustrup's example
cc --define sqrt=rand --define exit=abort hello.cc
resulting in
%Options = (define => { sqrt => 'rand',
                        exit => 'abort' })

The Kitchen Sink

The empty string is a valid option. It is written on the command line as a single dash, and results in the null key being entered into %Options with a value of 1. This form is conventionally used to specify that the program should take input from STDIN, rather than from a named file:
cat - 
You don't have to store all the options in %Options. Each option can have its own linkage specification, which may be a scalar ref, an array ref, a hash ref, or a code ref. For scalar, array, and hash refs, the option is stored in the referenced variable. If the linkage specification is a code ref, the option isn't stored anywhere; instead, the option name and value are passed to the referenced subroutine.

Option names can have aliases, and can be abbreviated to uniqueness. You can configure Getopt::Long for compatibility with GNU, or POSIX. You can control case sensitivity. You can cluster options. You can pass options through to called programs. You can intersperse options and non-option arguments on the command line. This allows different files to be processed with different options:

pr --numbers foo --nonumbers bar
Finally, you can specify a code ref to process arguments that aren't options. This allows GetOptions() to process the entire command line, and potentially reduces your program to a single
GetOptions(...);
call, plus subroutines.

The Rest of the Pack

Getopt::Std and Getopt::Long are both supplied with the standard Perl distribution. There are currently six other Getopt:: modules available on CPAN. Here is a quick tour.

Getopt::Simple

Getopt::Simple describes itself as a simple wrapper around Getopt::Long. However, nothing that supports the functionality of Getopt::Long can be entirely simple. In fact, Getopt::Simple is an object-oriented wrapper around Getopt::Long. Rather than coding option descriptions into strings, Getopt::Simple lays them out in hash tables:
$descriptions = { landscape => { type => ''   },
		  numbers   => { type => ''   },
		  after	    => { type => '=s' }  }
getOptions() is invoked as a method on a Getopt::Simple object:
$options = new Getopt::Simple;
$options->getOptions($descriptions, 
		     "Usage: pr -landscape -numbers -after time");
and options are retrieved through the $options object:
$option->{switch}{landscape} and ...
$option->{switch}{after    } and ...

Getopt::Tabular

Getopt::Tabular uses a table to describe options, and then parses the command line through a procedural interface:
@options = (['-landscape', 'boolean', 0, \$landscape',
		'print in landscape orientation'],
			 
	    ['-numbers'  , 'boolean', 0, \$numbers'  ,
		'print page numbers'		],
			
	    ['-after'    , 'string' , 1, \$time'     ,
		 'print after time'		],      );
         
GetOptions(\@options, \@ARGV) or exit 1;
Each line in the table describes a single option, and specifies the option name, type, number of arguments, action to take, and help text. The simplest action is to set a scalar; Getopt::Tabular can also collect arguments from the command line and assign them to an array, or pass them to a subroutine.

If anything goes wrong, GetOptions() automatically formats an error message, based on the help text supplied in @options. Getopt::Tabular also supplies an entry point called SpoofGetOptions().

SpoofGetOptions(\@options, \@ARGV)
parses the command line and checks it for errors, but doesn't take any action. This is particularly useful for programs that use subroutines to process arguments, because subroutines can do expensive or irreversible things.

Getopt::Mixed

Getopt::Mixed supports both long and short options: long because they are easy to remember; short because they are easy to type. Long options are introduced on the command line with two dashes; short options with one:
pr --landscape -a 12:00 foo bar
The programming interface is similar to Getopt::Long:
Getopt::Mixed::getOptions(@option_descriptions)
There is also an iterative form, which allows the program to process options one at a time:
Getopt::Mixed::init(@option_descriptions);
while (($option, $value) = Getopt::Mixed::nextOption()) { ... }
Getopt::Mixed::cleanup();
The results are stored in global variables. Given the command line shown above, Getopt::Mixed would set
$opt_landscape = 1
$opt_a         = '12:00'
Non-options arguments are left in @ARGV.

Getopt::Declare

Getopt::Declare doesn't parse anything directly. Rather, it builds and runs a parser. Options and their arguments are laid out in a single specification string:
$spec = q(-l		Print in landscape mode { $landscape = 1 }
	  -n		Print page numbers      { $numbers   = 1 }
	  -a <time:s>	Print after time	{ Queue($time)   });
The string describes each option, along with help text and a BLOCK to be executed when the option is found. Getopt::Declare::new creates a parser object from a specification string:
$parser = new Getopt::Declare $spec;
As written, this builds a parser and runs it on the command line; with additional arguments, it can parse strings or configuration files. Options and their values can be retrieved from $parser, but this is typically unnecessary, because the BLOCKs in the specification string contain arbitrary Perl code. There are powerful facilities for specifying and checking option syntax and arguments. Options can be required, and groups of options can be made mutually exclusive. Usage lines are automatically generated from the help text. The parser object can be saved and later run on different input.

Getopt::EvaP

Getopt::EvaP is broadly similar to Getopt::Simple and Getopt::Declare. Options and help text are specified in tables. A call to EvaP() parses the command line according to the tables and returns the results in an %Options hash:
EvaP \@option_specs, \@help_text, \%options
Perhaps the most interesting feature of Getopt::EvaP is that it has been implemented for Perl, Perl/Tk, Tcl and C. If you are developing in multiple languages, EvaP can provide a consistent user interface across all your applications.

Getopt::Regex

Getopt::Regex takes a different approach to managing the potential complexity of command line syntax. Rather than implementing sophisticated parsing facilities of its own, it relies on the Perl regular expression engine.
GetOptions(\@ARGV, [$regex, \$scalar  , $takesarg], 
  		   [$regex, sub {...} , $takesarg], ...);
For each option, the user passes an array ref. The first element is a regular expression, the second is either a scalar ref or a code ref, and the third indicates whether the option takes an argument. An element of @ARGV is recognized as an option if it matches a $regex. When an option is found, GetOptions() sets $scalar or calls sub {...}, as appropriate. If the option takes an argument, the argument is assigned to $scalar, or passed to sub {...}.

The Importance of Being Lazy

One of the principal virtues of a programmer is Laziness, and these modules provide a wonderful opportunity to be Lazy. Before parsing your own command line, look to see if there isn't a Getopt:: module that will do what you need. If there is, use it. If there isn't, encapsulate your parsing code in a new Getopt:: module, and consider submitting it to CPAN. Then other programmers can be Lazy, even if you can't.

Notes

Standard
There is perhaps a fine line between having non-standard interfaces and having 8 different standard interfaces.

Steven W. McDougall / resume / swmcd@world.std.com / 1999 May