Big fleas have little fleas
On their backs to bite 'em,
And little fleas have littler still
And so, ad infinitum.

Finding your files with File::Find

To see all the files in a directory, you can use a command like
%ls lib
Bar     Foo.pm  SCCS
%
To see all the files in a directory, and all its subdirectories, and all their subdirectories, all the way down, you can use a command like
%ls -R lib
Bar     Foo.pm  SCCS
lib/Bar:
Baz.pm  SCCS

lib/Bar/SCCS:
Baz.pm,v

lib/SCCS:
Foo.pm,v
%
This is called a recursive subdirectory search, and it is a very powerful way to operate on directory hierarchies.

find(1)

If you want to operate on some files in a hierarchy but not others, you can use a command like find(1). find(1) has many options that allow you to include or exclude files from your search. However, it has several drawbacks:

Stop me before I code again

Perl is an obvious language for writing such a program, and once you start coding Perl, you may wonder if you really need find(1) at all. After all, Perl supports recursion. How hard could it be?
sub Find
{
    my $dir = shift;
    opendir(DIR, $dir);
    my @files = readdir(DIR);
    for my $file (@files)
    {
	-d $file and Find("$dir/$file");
	-f $file and 
STOP! Don't write another line. It's already been done. It's called File::Find, it comes with the standard Perl distribution, AND when you use File::Find, you get these two BONUS LINES at NO EXTRA CHARGE:
        $file eq '.'  and next;
        $file eq '..' and next;

File::Find

To use File::Find, write
use File::Find;
find(\&Wanted, $dir);
find() does a recursive subdirectory search of $dir. It calls Wanted() once for each file and directory in $dir, including $dir itself. You can actually specify a list of directories, and find() will search all of them
find(\&Wanted, @dirs);
You get to write Wanted(). It is an arbitrary subroutine, and can do whatever you need. When Wanted() is called, find() relies on $_, so if you change it, you must restore it before returning from Wanted(). If Wanted() sets $File::Find::prune on a directory, then find() will not descend into that directory.

Wanted()

Typically, Wanted() begins by deciding whether it wants to operate on the current file. Regular expression matches on $_ do this concisely:
sub Wanted
{
    # only operate on Perl modules
    /\.pm$/ or return;	
    ...
}

sub Wanted
{
    # Don't descend into SCCS directories
    /SCCS/ and $File::Find::prune = 1;
    ...
}

finddepth()

File::Find has an alternate entry point called finddepth(). find() and finddepth() both traverse the directory hierarchy depth-first. The difference is that find() calls Wanted() on subdirectories on the way down, and finddepth() calls Wanted() on subdirectories on the way back up. This is easier to understand with an example. If we run
find(sub { print "$_\n" }, 'lib')
on the directory hierarchy shown at the beginning of this article, the output is
.
SCCS
Foo.pm,v
Bar
Baz.pm
SCCS
Baz.pm,v
Foo.pm
Note that find() calls the sub on SCCS before Foo.pm,v. On the other hand, if we run
finddepth(sub { print "$_\n" }, '.')
on the same hierarchy, the output is
Foo.pm,v
SCCS
Baz.pm
Baz.pm,v
SCCS
Bar
Foo.pm
.
and we see that finddepth() calls the sub on SCCS after Foo.pm,v. $File::Find::prune doesn't work in finddepth(), because finddepth() has already descended into the subdirectory before Wanted() has a chance to set it.

Kibo

There was a time when you could read—actually read—the usenet. That time is long past. Usenet has grown to thousands of newsgroups carrying millions of articles per day. Accessing usenet requires powerful tools, and even then, the best anyone can hope for is to see a tiny fraction of the traffic.

Many people access usenet through a newsreader. Newsreaders are good if they do what you want; they can be slow and clumsy if they do not. If you can't find a newsreader that does what you want, you can use File::Find to scan your news spool directly. Here's an example:

#!/usr/local/bin/perl
use strict;
use File::Find;

my($Group, $Text) = @ARGV;
my $Spool = "/var/spool/news";	# or wherever your newsspool lives
$| = 1;				# so we can see it run
find(\&Kibo, "$Spool/$Group");

sub Kibo
{
    -d and print "$_\n";
    -f and /^\d+$/ or return;
    print "$_\r";

    open(ARTICLE, $_) or return;
    my @lines = <ARTICLE>;

    for my $line (@lines)
    {
        $line =~ /$Text/o and print $line;
    }
}
This program takes two command line arguments: a newsgroup and a string. It reads all the articles in the newsgroup, and all its subgroups, and prints any lines that contain the string. It also prints the newsgroup names and article numbers as it goes; this is mainly so the user can tell that something is happening. Depending on the size of the newsgroup, the program can take a long time to run.

The filename filtering in Kibo() is typical. The -d and -f tests sort out files and directories. The articles in a newsspool have numeric filenames; the /^\d+$/ test skips any extraneous files that may be lying around.

If you actually want to write or use a program like this to scan usenet, check the scripts/news/ directory on CPAN; it contains several working examples. Remember, Laziness is one of the principle virtues of a programmer.


NOTES

Kibo
James "Kibo" Parry was one of the first people to employ this technique, regularly scanning his entire newsfeed for his own user name.

Steven W. McDougall / resume / swmcd@theworld.com / 1999 June