Checking links with LinkCheck

I'm not an early adopter. I started hearing about this World Wide Web thing in 1993, but I didn't figure out what it was until 1994. My ISP started hosting web pages in 1995, but I didn't write one until 1996, and I didn't write a second one until 1997.

Web pages grow like weeds in an untended garden. My web site now comprises 82 pages, with 1176 links. It's time to do some gardening. In particular, it's time to look for broken links, and either fix or remove them.

I'm certainly not going to go crawling through scores of pages by hand, clicking on links to see if they work. We need a program to do this. Yahoo lists programs that check web pages. I looked at some of these, but I didn't find any that were

so I decided to write my own.

Features

The first thing we need for a program is a name. We'll call this one linkcheck. To check the links on a page, we write
linkcheck http://my.isp.com/page.html

This will give us a report like

Checked 1 pages, 49 links          
Found 0 broken links

-r

As shown, linkcheck checks all the links on one page, but we want to check all the pages on a site. We can do this by recursively following links that we find to other web pages, and then checking those pages
linkcheck -r http://my.isp.com/page.html
Checked 144 pages, 1025 links          
Found 3 broken links

If we follow every link that we find, we're liable to end up spidering the entire web. To avoid this, we only follow links to pages on our own site: my.isp.com.

-o

We don't follow links to offsite pages, but there remains a question of whether to check links to offsite pages. If we want to do this, we specify the -o flag
linkcheck -o -r http://my.isp.com/page.html
Checked 144 pages, 1131 links          
Found 3 broken links

-v verbosity

If we find any broken links, we'd probably like to know what they are. The -v verbosity flag controls the amount of output that we get
-v 0
show count of broken links (default)
-v 1
also list broken links
-v 2
also list checked pages
-v 3
also list checked links

-t twiddle

Web pages can take a long time to download. While we're waiting, we'd like to see some output, so that we know the program is doing something, and so that we don't get too bored. We could use the -v flag, but we might be sending output to a file or pipe. Instead, we provide a twiddle
-t 0
none (default)
-t 1
spinner: | / - \
-t 2
progress report: "$Pages pages, $Links links, $Broken broken\r"

Output is written to stdout, while the twiddle displays on stderr. This allows us to redirect output and still see the twiddle. It also ensures that the twiddle is unbuffered, so that it displays in real time.

Algorithm

Here is a rough outline of the steps required to check the links on a page. Given a URL, we must

To make this into a usable program, we must also

Writing all this from the ground up would be a big job. Fortunately, we don't have to. Most of the heavy lifting has already been done by others, and made available to us in modules. Here are the modules used by linkcheck

Using these modules, we can bolt together the completed application with only a few hundred lines of code. In the remainder of this article, we'll see how to do this.

Modules

First, we'll review the modules

Getopt::Std

Getopt::Std parses command line options. See Parsing Command Line Options with GetOpt:: for further discussion.

URI

URI manages URIs: each URI object represents a single URI. URI has many methods for constructing, manipulating, and analyzing URIs, but we need only a few. To create a URI object, we write
$uri = new URI 'http://my.isp.com/page1.html#section1';

We can resolve relative links with the new_abs constructor

$uri2 = new_abs 'page2.html', $uri;  # http://my.isp.com/page2.html

Accessors extract the components of a URI

$uri->scheme;		# http
$uri->authority;	# my.isp.com
$uri->fragment;		# section1

Passing an argument to an accessor sets that component. Empty components are represented as undef.

$uri->fragment('section2');	# http://my.isp.com/page1.html#section2
$uri->fragment(undef);		# http://my.isp.com/page1.html

The as_string() method returns the string representation of a URI object. as_string() is overloaded onto the stringize ("") operator; this means that we can use a URI object almost anywhere that we can use a string

print "$uri\n";
$Visited{$uri} = 1;

LWP::UserAgent

LWP is the Library for WWW access in Perl. We use it to retrieve web pages. Perhaps the simplest way to get a web page is with the LWP::Simple module
use LWP::Simple;
$content = get($uri);

The get() method returns the contents of the web page, or undef on failure. However, we need a bit more control than that, so we'll use the LWP::UserAgent module, instead.

A user agent is any kind of HTTP client. LWP::UserAgent implements an HTTP client in Perl. To retrieve a web page, we create an LWP::UserAgent object, send an HTTP request, and receive the HTTP response.

$ua       = new LWP::UserAgent;
$request  = new HTTP::Request GET => $uri;
$response = $ua->request($request);

$response contains the contents of the web page

$content  = $response->content;

If we only need the HTTP headers—for example, to check the existence or the modification date of a page—we can make a HEAD request, instead

$request  = new HTTP::Request HEAD => $uri;

The request() method automatically handles redirects. We can recover the URL from which the page was ultimately retrieved as

$uri = $response->request->uri;

HTML::Parser

Once we have a web page, we want to find all the links on it. HTML::Parser parses web pages. We don't use HTML::Parser directly; rather, we create a subclass of it
use HTML::Parser;

package HTML::Parser::Links;

use base qw(HTML::Parser);

To parse a web page, we create an object of our subclass and pass the contents of the page to the parse method

$parser = new HTML::Parser::Links;
$parser->parse($content);
$parser->eof;

parse invokes methods in our subclass as callbacks. We only need one callback

sub start
{
    my($parser, $tag, $attr, $attrseq, $origtext) = @_;

parse calls start whenever it identifies the opening tag of an HTML markup. The parameters are

$parser
the HTML::Parser::Links object
$tag
the name of the HTML markup, e.g. h1, a, strong
%$attr
a hash of the attribute name=value pairs in the tag
@$attrseq
a list of the attributes in the tag, in their original order
$origtext
the original text of the tag

We only care about a few tags and attributes. If we find a base tag, we capture the URL so that we can resolve relative links on that page

$tag eq 'base' and
    $base = $attr->{href};

When we find an a (anchor) tag, we capture either the href (for links)

$tag eq 'a' and $attr->{href} and 
    $href = $attr->{href};

or the name (for fragments)

 
$tag eq 'a' and $attr->{name} and
    $name = $attr->{name};

Pod::Usage

It is common practice to embed the documentation for a Perl program within the program itself, in POD format. Pod::Usage parses any POD text that it finds in the program source and prints it. This makes it easy to add usage and help facilities to a program.
pod2usage();		   # print synopsis
pod2usage(VERBOSE=>1);  # print synopsis and options
pod2usage(VERBOSE=>2);  # print entire man page

pod2usage is typically called when there are errors on the command line, so it exits after printing the POD.

Packages

Modules and packages are related, but distinct, concepts. A module is a file that contains Perl code. A package is a namespace that contains Perl subroutines or variables.

Modules writers typically put their code into a package that is named after the module, to promote encapsulation and avoid name collisions. Conversely, package writers may put their code into a module, to make it available to other programs.

However, we can also embed packages directly in our program, simply by adding a package statement

package Spinner;

We use packages in our program to

If we were writing modules, we would need to

However, our packages are visible only within our program, so we needn't be so formal: we can create and use packages at our convenience. Here are the packages that we use within linkcheck

Spinner

The -t 1 option displays a spinner. This a 1-character animation, constructed by cyclically printing the characters
| / - \

in the same location on the screen. Here is the complete package

package Spinner;

use vars qw($N @Spin);

@Spin = ('|', '/', '-', '\\');

sub Spin
{
    print STDERR $Spin[$N++], "\r";
    $N==4 and $N=0;
}

There's not much to it. $N, @Spin, and &Spin are all contained in the Spinner:: namespace. To advance the spinner, we call

Spinner::Spin();

It is tempting to use file-scoped lexicals instead of package variables

package Spinner;

my $N;
my @Spin = ('|', '/', '-', '\\');

If Spinner were a module, this would be fine; however, in our case it wouldn't actually provide any encapsulation. File-scoping doesn't respect Package declarations, so any file-scoped lexicals would share the same namespace—and be subject to name collisions—with every other file-scoped lexical in the entire program.

HTML::Parser::Links

HTML::Parser::Links is our subclass of HTML::Parser. The code fragments shown above illustrate the base class interface. In our subclass, we have additional instance data, to represent the parsed HTML page, and accessors to return information about the page.

The new method is our constructor.

sub new
{
    my($class, $base) = @_;

    my $parser = new HTML::Parser;
    $parser->{base }    = $base;
    $parser->{links}    = [];
    $parser->{fragment} = {};

    bless $parser, $class
}

To create an HTML::Parser::Links object, we

Here is the complete start method

sub start
{
    my($parser, $tag, $attr, $attrseq, $origtext) = @_;

    $tag eq 'base' and
        $parser->{base} = $attr->{href};

    $tag eq 'a' and $attr->{href} and do
    {
        my $base = $parser->{base};
        my $href = $attr->{href};
        my $uri  = new_abs URI $href, $base;
        push @{$parser->{links}}, $uri;
    };

    $tag eq 'a' and $attr->{name} and do
    {
        my $name = $attr->{name};
        $parser->{fragment}{$name} = 1;
    };
}

We only care about base and a tags. If we find a base element, we save the href so that we can resolve relative links. When we find a link, we create a new URI object and add it to the list of links. Finally, if we find a fragment, we add it to the fragment hash.

We have two accessors.

$parser->links()

returns a list of all the links on the page.

$parser->check_fragment($fragment)

returns true iff $fragment exists on the page.

Page

The Page package retrieves and parses web pages. The web is multiply connected: there many be many links to a single web page. However, downloading pages over the network takes time, so we don't want to download any page more than once.

Page caches web pages in %Page::Content. The URL is the hash key, and the page content is the value. The first time we request a page, Page downloads it and caches the contents; any subsequent requests for the same page are satisfied from the cache, with no additional network activity.

The Page package also parses web pages. Parsing a page doesn't require network I/O, but it still takes time, and if we create and run a new parser for every fragment that we have to check, that time could be significant.

To avoid this, Page caches parsers in %Page::Parser. The hash key is the page URL, and the value is an HTML::Parser::Links object.

Here is the external interface for the Page package.

$page    = new Page $uri;
$uri     = $page->uri;
$links   = $page->links;
$content = get   $page;
$parser  = parse $page;

Link

The Link package checks the validity of a single link. Its external interface is very simple
$link = new Link $uri;
$ok   = $link->check;

Like the Page package, Link has some optimizations to avoid unnecessary operations. Checking links breaks down into two cases. If the link has a fragment

http://my.isp.com/page.html#section

then we have to download the entire page, parse it, and then verify that the fragment exists in the page. If the link has no fragment

http://my.isp.com/page.html

then we don't have to parse the page; in fact, we don't even have to download it: a HEAD request will tell us whether the page exists, and that's all we care about.

Internally, the check() method calls check_fragment() or check_base(), respectively, to handle these two cases. check_fragment() uses the Page package to download and parse the page, then it checks to see if the fragment exists in the page. check_base() issues a HEAD request directly to see if the page exists.

In either case, check() caches the results in %Link::Check, so we never have to check any link more than once.

Program

With all the infrastructure provided by the modules and packages, we can complete linkcheck in about 100 lines of code. Here is the main program
package main;

my %Options;
my %Checked;
my($Scheme, $Authority);
my($Pages, $Links, $Broken) = (0, 0, 0);

getopt('vt', \%Options);
Help();
CheckPages(@ARGV);
Summary();

Globals

We declare our globals as file-scoped lexicals: this is the main program; the file scope properly belongs to it. %Options holds command line options. %Checked is a hash of checked URLs; we use it to avoid infinite recursion if there is a cycle of links on our web site. $Authority records the current site; we use it to identify onsite links. $Pages, $Links and $Broken provide counts for Progress() and Summary().

CheckPages

After parsing command line options, @ARGV contains a list of pages to check. CheckPages() creates a URI object for each page, and calls CheckPage() on it.
sub CheckPages
{
    my @pages = @_;
    my @URIs  = map { new URI $_ } @pages;

    for my $uri (@URIs)
    {
        $Scheme    = $uri->scheme;
        $Authority = $uri->authority;
        CheckPage($uri);
    }
}

CheckPage

CheckPage() checks a single page.
sub CheckPage
{
    my $uri = shift;
    
    $Checked{$uri} and return;
    $Checked{$uri} = 1;
    $Pages++;
    Twiddle();
    print "PAGE $uri\n" if $Options{v} > 1;

    my $page  = new Page $uri;
    my $links = $page->links;
    defined $links or
        die "Can't get $uri\n";

    CheckLinks($page, $links);
}

After some housekeeping, it creates a new Page object, gets all the links on the page, and calls CheckLinks().

linkcheck checks for broken links, but the pages that the user specifies on the command line have to exist. If we can't download one, we die.

CheckLinks

CheckLinks() checks the links on a page.
sub CheckLinks
{
    my($page, $links) = @_;
    my @links;

    for my $link (@$links)
    {
        $link->scheme eq 'http' or next;
        my $on_site = $link->authority eq $Authority;
        $on_site or $Options{o} or next;

        $Links++;
        Twiddle();
        print "LINK $link\n" if $Options{v} > 2;
        Link->new($link)->check or do
        {
            Report($page, $link);
            next;
        };

        $on_site or next;
        $link->fragment(undef);
        push @links, $link;
    }

    $Options{r} or return;

    for my $link (@links)
    {
        CheckPage($link);
    }
}

The first loop checks the links. We only check HTTP links, and we only check offsite links if the -o flag is specified. The actual check is

Link->new($link)->check

If the check fails, we call Report().

If the check succeeds and the link is onsite, we add it to @links. If the -r flag is specified, we fall through to the second loop and call CheckPage() on each onsite link.

Output

Report() prints broken links, according to the -a and -v flags.

Twiddle() advances a spinner or prints a progress report, according to the -t flag.

Summary() prints a final count of checked pages, checked links, and broken links.

Distribution

linkcheck 0.01
the version described in this article
linkcheck 1.06
the latest version, with support for

Conclusion

We've seen how to use existing modules to manage URIs, download web pages, and parse HTML. We've written our own packages to cache web pages and links. Building on this infrastructure, we've bolted together a non-trivial application with little more than 100 lines of code.

The power of packages like Page and Link isn't that they do anything very complex or sophisticated; rather, it is that once we have written them, we can use them without having to think about how they work.

Early versions of linkcheck didn't have the Page and Link packages. Instead, they cached pages and links in open code in the main program. The resulting program was intricate, fragile, and difficult to modify.


NOTES

modules
Remember modules? It's a column about modules.
file-scoped lexicals
A correspondent points out that block-scoped lexicals would also solve this problem.
package Spinner;
{
    my $N;
    my @Spin = ('|', '/', '-', '\\');

    sub Spin
    {
	print STDERR $Spin[$N++], "\r";
	$N==4 and $N=0;
    }
}
multiply connected
Otherwise, we would call it the World Wide Tree (WWT)

Steven W. McDougall / swmcd@world.std.com / resume / 2000 Oct 12