This article starts with a brief review of the mathematical definition of a set. Then it describes 5 different implementations of sets in Perl, with attention to

- universe
- implementation
- performance
- special features

a well defined collection of distinct objectsThis definition has three important elements:

- well defined
- you can decide whether or not something is in the set
- distinct
- no duplications
- nothing else
- this means that if two sets have the same elements, then they are the same set, because there isn't anything else that could distinguish them.

In Perl, we have a pre-existing world, already populated with things: numbers, strings, lists, hashes, references. So the first thing we have to decide is what can be an element of a set. Possibilities include

- numbers
- real
- integer

- strings
- references
- anything

- you can find out whether a key is in a hash
- keys are guaranteed unique

%set = map { $_ => 1 } @listHash keys are strings, so the universe for this implementation is the collection of all strings. Perl will happily convert back and forth between strings and numbers, so we can consider the universe to include numbers, as well.

Testing for membership is trivial

$set{'foo'} and ...and taking the union of two sets is easy enough

%u = map { $_ => 1 } keys %a, keys %b;But taking an intersection is a bit more involved

%i = map { $_ => 1 } grep { $b{$_} } keys %a;and constructing the complement of a set is clearly problematic.

If all we need to do is put some elements in a set and extract them later, then using a hash in open code is fine. If we want to do anything more complicated, we will probably want to write some subroutines to implement set operations. And if we're going to write subroutines, we should probably put them in a package, and if we're going to create a package, we might as well make it into a module and put it on CPAN...

`Set::Scalar`

`Set::Scalar`

, and it is available on CPAN.
`Set::Scalar`

provides an object-oriented interface: each set is a represented by a `Set::Scalar`

object. To create a new set, write
$set = new Set::Scalar(qw(a b c d));Internally, the elements of the set are stored as hash keys, so the universe is the collection of all strings. There are methods for set operations and predicates

$u = union $a $b $i = intersection $a $b subset $a $b and ... is_null $a and ...and there are operator overloads for many of the methods

$u = $a + $b # union $u = $a * $b # intersection $a < $b and .... # proper subset

`Set::Scalar`

uses an instance variable to indicate whether the keys in the hash are the strings that are in the set, or the strings that are `Set::Scalar`

has the same performance characteristics as a hash in open code. The memory required for a set is proportional to the number of elements in the set. Insertion, deletion, and testing for an individual element all require unit time. The time required for most other set operations is linear in the number of elements.
`Set::Scalar`

provides an `as_string()`

method, which returns a print representation of a set. The default format is
(a b c d)The format can be controlled by setting display attributes.

`as_string()`

is overloaded onto the stringize (`""`

) operator.
`Set::Scalar`

says that one set can contain a reference to another. However, the current implementation stringizes objects as they are entered into the hash. This means that the first set merely contains the print representation of the second, rather than a live reference to it.
`Set::Scalar`

object is constructed from a hash reference, then it records both the keys and values.
my $transcendentals = new Set::Scalar { e => 2.71828 pi => 3.14159 };Methods are provided for accessing keys and values.

`Set::Object`

`Set::Object`

implements a set of Perl objects. In Perl, objects are accessed through references. The obvious implementation of a set of objects would therefore be a hash of object references. However, this won't work in Perl, because hash keys are strings. If you provide a reference as a hash key, Perl accepts it, but then uses its print representation instead. For example,
$ref = bless {}, 'foo'; %set = ($ref => 1); print keys %set, "\n";prints something like

foo=HASH(0x10015270)As the Perl documentation disarmingly explains, it is not possible to convert the print representation back into a live reference, because the reference count has been lost.

`Set::Object`

implements it. It creates its own hash table. It implements the hash table using Perl arrays, and derives the hash key from the machine address of the reference. It carefully increments the reference count every time it inserts an object into its hash table, and decrements the reference count every time it removes one. Needless to say, all of this is done in XS code.
To the user, `Set::Object`

presents a standard object-oriented interface. Each set is represented as a `Set::Object`

object

$f = new Foo; $b = new Bar; $a = new Set::Object($f, $b);

`Set::Object`

also provides the usual set operations and operator overloads
$u = union $a $b; $i = $a * $b;

`Set::Object`

is implemented as a hash, it has the same performance characteristics as a native hash: single element operations run in unit time, set operations run in linear time, and storage is proportional to the number of elements.
The `Set::Object`

POD shows some benchmarks suggesting that
element lookup time for `Set::Object`

is comparable to that
of a native hash, while element insertion is an order of
magnitude faster. However, the real significance of
`Set::Object`

isn't its performance relative to a native
hash, but the fact that it makes it possible to hash live references
in the first place.

`Set::Object`

in native Perl by using a hash and storing each reference as both key and value:
sub Set::Object::new { my($class, @refs) = @_; my $hash = { map { $_ => $_ } @refs }; bless $hash, $class }Then the stringized reference serves as a unique hash key, while the hash value preserves a copy of the live reference.

`Set::IntRange`

`Set::IntRange`

implements a set of integers.
`Set::IntRange`

allocates one bit for each integer. If the bit is set then the integer is in the set; if the bit is clear then the integer is not in the set.
Obviously, `Set::IntRange`

can not allocate one bit for *every* integer. Actually, it allocates one bit for every integer in a finite range, which is specified when the set is created:

$set = Set::IntRange->new($lower,$upper);Like the other

`Set::`

modules, `Set::IntRange`

provides set operations and operator overloads. One restriction on these operations is that all the operands must have the same range. `Set::IntRange`

contains methods for reallocating a set with a different range, but it is up to the programmer to look after this; `Set::IntRange`

won't do it automatically.
`Set::IntRange`

is a front-end for `Bit::Vector`

. `Bit::Vector`

in turn is a Perl interface for a package of C subroutines that allocate and manage arrays of bits. The whole thing is built for speed. Typical applications include infinite-precision arithmetic and graph algorithms.
Each integer serves as its own index into the bit array, so single element operations run in unit time.

Asymptotically, the storage required for a set and the time required
for set operations both increase linearly with the size of the
set. However, at any point in time, a `Set::IntRange`

object has some range, and given that range, space and time requirements are
fixed. This may be an advantage or a disadvantage, depending on the
application.

In any case, for a given set, `Set::IntRange`

is likely to have much better performance than a hash implementation, because of the array representation and because of the underlying C code.

[lower, upper]where

`lower`

is the first integer in the interval and `upper`

is the last. This is a mathematical notation, and doesn't have anything to do with Perl arrays or references.
`Set::IntRange`

provides methods for operating on intervals. For example,

$set->Interval_Fill($lower, $upper)adds the interval

`[$lower, $upper]`

to `$set`

.
`Set::IntRange`

has two iterators: `Interval_Scan_inc`

and `Interval_Scan_dec`

. These work in terms of intervals. To iterate over the elements of a set, write a loop like this
for ($start = $set->Min(); ($lower, $upper) = $set->Interval_Scan_inc($start); $start = $upper + 1) { ... }This loop executes once, not for each element in the set, but for each interval. In each iteration, the current interval is reported as

`[$lower, $upper]`

.
If there is only one integer in the interval,
then `$upper`

and `$lower`

are the same.
`Bit::Vector`

`BitVector`

method returns a reference to the underlying `Bit::Vector`

object. The `Bit::Vector`

module provides methods for operating on the bit array in many different and powerful ways.
`Set::IntSpan`

`Set::IntSpan`

implements a set of integers. It was designed specifically to manage the article lists in `.newsrc`

files.
Article lists use a special format that compactly represents long runs of consecutive numbers:
alt.foo: 1-21,28,31 alt.bar: 1-14192,14194,14196-14221

`Set::IntSpan`

uses the .newsrc format internally. Rather than record each { 1, 2, ..., 10 }is represented by a list not of 10 numbers, but just 2: 1 and 10. The advantage of this becomes clear when we consider the set

{ 1, 2, ..., 1_000_000_000 }This set contains a billion elements, but it is still represented internally by a list of 2 numbers.

`Set::IntSpan`

presents a standard object-oriented interface, and provides the usual set operations.

$a = new Set::IntSpan "1-21,28,31"; $b = new Set::IntSpan [1..10]; $u = union $a $b; $i = intersect $a $b;

`Set::IntSpan`

has rather different performance characteristics from the other `Set::`

modules. Storage and time requirements are all proportional to the number of intervals in the set.
For sets with a few large intervals, this can be a major performance advantage. For sets with many small intervals, it can be a major disadvantage. This example from the POD illustrates the issue:

$billion = new Set::IntSpan '0-1_000_000_000'; # OK $odd = grep_set { $_ & 1 } $billion; # trouble $even = map_set { $_ * 2 } $billion; # double trouble

`Set::IntSpan`

requires that the boundaries of each interval be finite. However, the first interval may have no lower bound, and the last may have no upper bound. This allows it to represent sets that extend to minus and plus infinity, respectively.

The utility of this isn't that these kinds of infinite sets are especially common or useful. Rather, it is that users can carry out set operations involving complements without getting tangled up in endpoint problems at `INT_MAX`

.

`grep_set`

and `map_set`

operators shown above, `Set::IntSpan`

has explicit iterators. For example, this loop
for ( $element=$set->first; defined $element; $element=$set->next) { ... }will iterate over all the elements of

`$set`

, in order. In addition,
$set->Start($n)sets the iterator to

`$n`

; this allows a loop to iterate over portions of `$set`

, even if `$set`

is infinite.
{ n | n > 2 } the set of integers greater than 2 { n | n%5==0 } the set of integers that are multiples of 5Representing a set like this seems simple enough. Each predicate is represented by a subroutine that returns true iff the predicate is true on its argument:

sub g2 { $_[0] > 2 } sub m5 { $_[0]%5 == 0 }The set object just needs a reference to the subroutine, and then it can decide whether something is an element:

sub Set::Predicate::new #create a new object { my($class, $predicate) = @_; bless { predicate => $predicate }, $class } sub Set::Predicate::element # return true iff $element is in $set { my($set, $element) = @_; $set->{predicate}->($element) }Then we can write

$a = new Set::Predicate \&g2; $b = new Set::Predicate \&m5; $u = union $a $b;Implementing

`union`

(and the rest of the set ops) for this representation is left as an exercise for the reader.
- native
- as it is, for example, in Pascal
- defined
- Real mathematicians don't define sets, they axiomatize them.
- more complicated
- or less, depending on your point of view
- collection
- For technical reasons, the universe may not actually be a set, so
we refer to it as a
*collection*, instead. - numbers
- Of course, we can not then distinguish between a number and the string that represents it in the hash, but this is rarely necessary or desirable, at least in Perl.
- current implementation
- According to the author, this is fixed in
`Set::Scalar 0.9`

. - die!
- Well, no. XS has a place. I just hate to see anyone coding XS unless it's absolutely necessary.
- linearly
- Storage requirements are linear. Time requirements are actually O(n*log(n)): n for the number of elements in the set, and log(n) for the propagation delay through the address decoders between CPU and RAM. If you regard memory access time as a fixed quantity, they you aren't really headed for infinity. Remember, asymptotes don't stop just because you've maxxed out your processor address space.

Steven W. McDougall / resume / swmcd@world.std.com / 1999 November 15