Writing usenet applications with News::* modules

Usenet

Usenet is a computerized bulletin board that runs over the internet. Users post articles to various newsgroups; others can read the articles, and may reply by posting their own.

Usenet is a distributed system. There is no single location where articles are stored. Rather, articles are stored on many different computers, called servers. The servers are all connected to the Internet, and they constantly exchange articles with each other.

Similarly, there is no single computer through which users submit or retrieve articles. Users can submit and retrieve articles through any computer that is connected to a server; these computers are called clients.

An article submitted from a particular client to a particular server will usually propagate to the bulk of the servers on the Internet within 24 hours. As it reaches each server, it becomes available to all the users who have clients that connect to that server. Servers typically retain articles for a few days or weeks, and then discard them.

Clients and servers send articles to each other using Network News Transport Protocol (NNTP).

News software

Needless to say, all of this is managed by software. We can identify four major components that are required to make usenet work.
database
Every server needs a database to store articles. Many servers store each article in single file, and organize these files into a directory tree. This effectively uses the file system as a database.
NNTP server
Every server needs software to run the server side of NNTP
NNTP client
Every client needs software to run the client side of NNTP
Newsreader
The user needs application software that provides an interface to usenet

All this software exists so that users can read usenet without having to understand or worry about the mechanics of running a large distributed database.

The user experience

Traditionally, the user's experience of usenet was mediated by the newsreader, and looked something like this.

Details vary between newsreaders. Some are text-based, others have GUIs; some sort articles by date, others by thread; some are integrated with text editors, or web browsers. However, the underlying user model is largely the same, and has been since the inception of usenet.

Overload

This model is no longer sufficient, for several reasons.

Like the rest of the Internet, usenet has undergone explosive growth in the last few years. There are now over 10,000 newsgroups, carrying among them millions of articles. A single newsgroup may have thousands of available articles, and receive hundreds of new ones each day.

At the same time, much of the traffic on usenet is mislabeled, off-topic, inappropriate, or repetitive. Some of this is due to deliberate abuse, such as trolling or spam, some is inherent in the nature of usenet, and some is simply due to the fact that with an exponentially growing user base, most users are inexperienced.

In addition, user requirements have become more complex. Some newsgroups carry ordinary discussions; some are moderated; some carry binary files that require special encoding. Users may archive newsgroups, or gateway them to mailing lists, or collect statistics on the traffic.

Newsreaders still work, but they don't provide all the functions that users want. And the functions that they do provide may no longer be useful. For example, most newsreaders will list available newsgroups, but a list of 10,000 newsgroups may be unmanageable.

If newsreaders don't meet your needs, you may consider writing your own applications to manage usenet.

Roll your own

Usenet is simple enough that writing your own application software is a tenable proposition. It is not, for example, like writing your own compiler because the existing ones don't suit you.

On the other hand, it isn't trivial. Writing a usenet application will potentially involve you in the details of NNTP (see RFC 977), and the format of news articles (see RFC 1036). It may require you to navigate the article database on a news server, or make network connections to one. You may have to read and write .newsrc files.

Fortunately, you don't have to do all this yourself. Much of the infrastructure necessary to write a usenet application has been packaged in modules and made available on CPAN.

This article surveys eight modules. Six of them encapsulate basic functionality needed to write a usenet application:

Two others provide more specialized functions:

File::Find

The first step in any usenet application is generally to get access to the article database. If your machine happens to be a news server, then the database may be accessible to you on the local file system, for example, in a directory tree rooted at /var/spool/news/.

The File::Find module is useful for navigating directory trees. See Finding your files with File::Find for details and examples.

News::NNTPClient and Net::NNTP

If you don't have local access to an article database, you will need to connect to a server using NNTP. There are two modules that will do this for you: News::NNTPClient and Net::NNTP. News::NNTPClient is a free-standing module, while Net::NNTP is part of the larger libnet package.

To retrieve articles using News::NNTPClient, you do something like

$server = "news.isp.com";
$client = new News::NNTPClient $server
$group  = "comp.lang.perl.modules"'
($first, $last) = $client->group($group);
	
for ($n=$first; $n<=$last; $n++)
{
    @lines = $client->article($n);
}

To post an article, do

@header = ("Newsgroups: test", 
	   "Subject: test", 
	   "From: tester");
@body   = ("This is the body of the article");

$client->post(@header, "", @body);

The interface to Net::NNTP is similar:

$client             = new Net::NNTP $server;
($n, $first, $last) = $client->group($group);

print "$group contains $n articles\n";

$lines = $client->article($first);
$client->post(@header, "", @body);

News::Article

News articles have a simple format. There are some headers, like
Newsgroups: test
Subject: test
From: tester

and a body, which can contain arbitrary ASCII text:

This is the body of the article

The body is separated from the headers by a single blank line.

The News::NNTPClient and Net::NNTP article methods return articles as an array (or a reference to an array) of lines. You could go groveling through the article, parsing headers and locating the body; it wouldn't even be that hard: Perl is excellent at this sort of text processing. But you don't have to. Instead, you can use News::Article.

News::Article takes the list of lines that constitute an article and creates an object to manage that article. It provides methods for getting and setting headers and the body. It can also post the article back through a Net::NNTP object.

$article    = new News::Article $lines;
@newsgroups = $article->header("Newsgroups");
$subject    = $article->header("Subject"   );
$body       = $article->body;
@quoted     = map { "> $_" } @$body;

$followup   = new News::Article;
$followup->set_headers(From       => "clueful@isp.com",
			  Newsgroups => [ @newsgroups ]  , 
			  Subject    =>   $subject       );
$followup->set_body   (@quoted, @incisive_commentary);
$followup->post($client);

News::Newsrc

To help applications keep track of articles, servers assign each article an article number. There is a separate series of article numbers for each newsgroup. Article numbers begin at 1 when the newsgroup is created on the server, and increment indefinitely. Over time, article numbers reach into the thousands; on heavily-trafficked newsgroups, the millions.

Many usenet applications keep lists of articles that have been read or otherwise processed. Listing millions of article numbers would be infeasible; instead, they use a compressed format, like this

1-1013,1015,1020-1030

Each newsgroup has its own article list. Article lists are typically stored in a .newsrc file:

comp.lang.perl.announce:  1-1186
comp.lang.perl.misc:      1-233883,234000-234018
comp.lang.perl.moderated: 1-3406,3478
comp.lang.perl.modules:   1-25308,25450,25452,25494

Parsing a .newsrc file isn't too difficult, and you can use Set::IntSpan to manipulate the article lists. But, again, you don't have to. News::Newsrc will take care of the whole thing for you.

$newsrc = new News::Newsrc;      
$newsrc->load("$ENV{HOME}/.newsrc");
    
$group  = "comp.lang.perl.modules";
$number = 42;
if (not marked $newsrc $group, $number)
{
    # process the article
    mark $newsrc $group, $number;
}

$newsrc->save;

News::Gateway

News::Gateway provides infrastructure and architecture for a common usenet application: news/mail gateways.

Email messages and usenet articles are very similar, both in structure and function. They have some headers and a body, and they are transported over the network from a sender to a receiver. News/mail gateways allow articles that originate on usenet to be read as email, and allow email messages to be posted to usenet. This is useful in several contexts.

News::Gateway defines a 3-layer architecture for gateways.

  1. infrastructure
  2. implementation
  3. policy

News::Gateway provides the infrastructure, and it defines a framework for collecting and organizing implementations, which may be provided by third parties. Policy is implemented separately by each application. The goals are to handle the details common to all gateway applications, and reduce the amount of code that must be written in each application.

As always, programmers should consider using existing modules in order to reduce the amount of code that they have to write. However, there is a special reason to use News::Gateway. Programs that handle mail and news can have subtle bugs. They may make assumptions that are valid for most users, systems, and networks, and then fail in rare instances where those assumptions don't hold. In the worst case, they can create infinite mail loops and flood servers. These problems can be intermittent and difficult to reproduce; they are typically detected by users separated by time and distance from the original programmers, and they can be very difficult to track down.

If you are writing any kind of mail/news gateway, consider using News::Gateway.

News::Scan

News::Scan reads articles from a newsgroup and computes statistics about the traffic in the group. These include the total number of

It also collects information about

News::Scan is not a general purpose module. It was written for one single purpose: collecting traffic statistics. Typically, these are collected at intervals and then posted on the newsgroup, so that people who read that newsgroup can have some idea what kind of traffic it is carrying.

With just a little more code (and a little less documentation), News::Scan could have been made into an application. But then it would run from the command line; it would scan newsgroups in just one way; it would provide output in just one format. Anyone who wanted to do anything different would be faced with either writing their own program from scratch or trying to hack News::Scan to meet their needs.

Because News::Scan is a module, programmers can easily embed it in larger programs, they can take input from whatever sources they have, and they can generate output in whatever format they need. If you want to collect traffic statistics, News::Scan is the module for you.


Notes

server
In fact, servers have to run both sides of NNTP.
most
If you accept the estimate that the internet doubles each year, then at any point in time, half the users have less than 1 year of experience.
traffic
Traffic statistics are rather like weather reports

Steven W. McDougall / resume / swmcd@theworld.com / 1999 August