Do not meddle in the affairs of wizards, for you are crunchy and good with ketchup.

XS Mechanics

This article is about XS. It explains what it is, why it is, how it works, and how to use it. It includes a complete, working example of an XS module, and a stub module that you can use as a starting point for your own code. It is an express goal of this article to provide the background and information necessary for you to write your own XS modules.

This article is in five parts
November Introduction motivation, definitions, examples
December Architecture the Perl interpreter, calling conventions, data representation
January Tools h2xs, xsubpp, DynaLoader
February Modules Math::Ackermann, Set::Bit
March Align::NW Needleman-Wunsch global optimal sequence alignment

Modules

In the first part of this article, we presented a problem that could benefit from an XS implementation. Two months ago, we described the architecture of XS. Last month, we wrote XS code. This month, we write XS modules.

First, we discuss some design alternatives for XS modules. Then we present examples that illustrate different designs.

Design

Broadly speaking, an XS module is any Perl module that includes XS code.

Perl does not have classes or objects per se. Rather, it has a collection of features that conspire to support an object-oriented (OO) programming model. These include

subroutine
a method in a class
package
a namespace for a class
module
a file to hold the code for a class
reference
a handle for an object
use
loads and initializes a class
bless
associates an object with a package
->
invokes methods on an object

XS is just one more feature in the mix, and there are different ways we can use it. In particular, XS modules can represent data in either C or Perl, and can have methods written in either C or Perl.

If we represent object data in C, then methods written in C—that is, xsubs—have native access to the data. Methods written in Perl must access object data through accessor methods.

If we represent object data in Perl, then methods written in Perl have native access to the data. Methods written in C must access object data through the Perl C API.

This table summarizes the four possibilities for data representation and access.
Data
Perl C
Method Perl native accessor method
xsub Perl C API native

Representing data in Perl may be simpler; representing data in C may be necessary for performance. We give an example of each approach in the remainder of this article.

Math::Ackermann

Ackermann's function is defined recursively as
A(0  , n  ) = n + 1 
A(m+1, 0  ) = A(m, 1) 
A(m+1, n+1) = A(m, A(m+1, n))

It is perhaps unparalleled in consuming enormous amounts of RAM and CPU with very little code, while still doing something of at least nominal utility.

Because this is such a CPU-intensive function, we would like to write a module that does two things

Caching values is easy in Perl.

package Math::Ackermann;

sub new
{
    my $class = shift;
    bless [], $class
}

sub compute
{
    my($ackermann, $m, $n) = @_;
    
    defined $ackermann->[$m][$n] or
    	$ackermann->[$m][$n] = A($m, $n);
    
    $ackermann->[$m][$n]
}	

Computing the value in C is easy and efficient.

int A(int m, int n)
{
    if (m==0) return n+1;
    if (n==0) return A(m-1, 1);
    return A(m-1, A(m, n-1));
}
Here is the XS code to connect the C subroutine to the Perl module
int
A(m, n)
        int m
        int n

Now we can compute values like this

my $ackermann = new Math::Ackermann;

print $ackermann->compute(1, 1);  # computes value
print $ackermann->compute(2, 2);  # computes value
print $ackermann->compute(1, 1),  # returns cached value

In this design, data is represented in Perl, and computation is done in C. This is a good allocation of resources, because data access time is small compared to computation time. Also, the volume of data is very small: 2 integers on input; one integer on output, so storage efficiency is not an issue.

Because Math::Ackermann represents instance data in Perl, routines that are written in Perl, such as new() and compute(), have native access to that data. Routines that are written in C, such as A(), must access data through the Perl C API. The XS routine shown above describes the calling sequence for A(), and xsubpp generates the necessary calls to the Perl C API.

Next, we'll look at a Perl module that represents data in C.

Set::Bit

Set::Bit represents a set of integers in the range 0..n It allocates one bit of storage for each integer in the range. If the bit is set then the integer is in the set; if the bit is clear then the integer is not in the set.

The computation involved in set operations is trivial: setting and clearing bits to add or remove integers from a set; and'ing and or'ing bits to compute unions and intersections. Because of this, overall performance is dominated by data access time: how long it takes to locate a particular bit in a particular set. In addition, sets may be arbitrarily large, so storage efficiency will inevitably become an issue.

To improve performance, we represent Set::Bit objects as C structs

typedef struct
{
    int  nBits;	/* range of set: 0..nBits-1 */
    int  nInts; /* number of integers       */
    int *pInts; /* vector of integers       */
} Vector;

and we write C subroutines to manipulate these structs

Vector *new      (int nBits);
void    DESTROY  (Vector *pVector);
void    dump     (Vector *pVector);
int     top      (Vector *pVector);
void    insert   (Vector *pVector, int n);
void    Remove   (Vector *pVector, int n);
int     member   (Vector *pVector, int n);
Vector *Union    (Vector *pA, Vector *pB);
Vector *intersect(Vector *pA, Vector *pB);

The name Vector refers to the vector of ints that holds the bits. To create a Set::Bit object, we call new()

Vector *new(int nBits)
{
    Vector *pVector = (Vector *) malloc(sizeof (Vector));
    pVector->nBits = nBits;

    pVector->nInts = (nBits + sizeof(int) - 1) / sizeof(int);
    pVector->pInts = (int *) calloc(pVector->nInts, sizeof(int));

    return pVector;
}

The new() routine

To add an integer to a set, we call insert().

void insert(Vector *pVector, int n)
{
    int q, r;

    q = n / sizeof (int);
    r = n % sizeof (int);

    pVector->pInts[q] |= 1<<r;
}

When we are done with a set, we call DESTROY() to free the storage.

	void DESTROY(Vector *pVector)
	{
	    free(pVector->pInts);
	    free(pVector);
	}

The other subroutines have similar implementations. All the C declarations and definitions are in

A C object

We installed A() as an xsub in the Math::Ackermann package, but the Math::Ackermann object was an ordinary Perl array.

We will install the C subroutines in vector.c as xsubs in the Set::Bit package, but we want to do more than that. We don't want any kind of Perl data structure to be the Set::Bit object. Instead, we want the C-language Vector struct to be the Set::Bit object.

We have not written a module where the object itself is a C struct, and we need to learn some new XS tricks to make it work. If we present the solution all at once, it will likely seem complex and opaque. Instead, we will discover the solution by trial and error.

h2xs

As always, we begin by running h2xs
.../development>h2xs -A -n Set::Bit
.../development>cp vector.c Set/Bit/
.../development>cp vector.h Set/Bit/

We add an OBJECT key to the argument list of WriteMakefile() in Makefile.PL

'OBJECT'    => 'Bit.o vector.o'

We #include vector.h in Bit.xs, and add a PROTOTYPES directive. Bit.xs now looks like this

#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include "vector.h"

MODULE = Set::Bit               PACKAGE = Set::Bit

PROTOTYPES: ENABLE

new

Since we don't know what else to do, we try writing an XS routine for new, like this
Vector *
new(nBits)
	int nBits

Now, we run

.../development/Set/Bit>perl Makefile.PL
.../development/Set/Bit>make

and immediately get

Error: 'Vector *' not in typemap in Bit.xs, line 12

The typemap

This isn't hard to understand. make invokes xsubpp, and xsubpp generates code to translate between C and Perl data representations. The default type map in
/usr/local/lib/perl5/version/ExtUtils/typemap

contains entries for translating many built-in C data types. But Vector * isn't a built-in C data type, and the default typemap certainly hasn't an entry for it.

To tell xsubpp how to translate a Vector *, we create a local typemap in the module directory, and add an entry to it for Vector *

In the most general case, we would need to make up a new XS type, and add entries to the TYPEMAP, INPUT, and OUTPUT sections. If we called our XS type T_VECTOR, then our typemap file would look something like

TYPEMAP
Vector *	T_VECTOR

INPUT
T_VECTOR
	# Code to translate a Perl scalar to a Vector *

OUTPUT
T_VECTOR
	# Code to translate a Vector * to a Perl scalar

T_PTROBJ

But we don't have to do all this. The default typemap file already supplies a suitable XS type, called T_PTROBJ, together with code fragments for it in the INPUT and OUTPUT sections. To use T_PTROBJ, all we have to do is create a
.../development/Set/Bit/typemap
file and put this line
Vector *	T_PTROBJ

in it. This tells xsubpp to use the code fragments for T_PTROBJ whenever it needs to translate between a Vector * and a Perl scalar.

T_PTROBJ was created expressly for our purpose: to make C structs into Perl objects. If we now run make, we get a clean compile. But before we try to run our code, let's look at Bit.c and find out what xsubpp has (and hasn't) done for us.

sv_setref_pv

Here is the glue routine emitted by xsubpp to connect new() to Perl
XS(XS_Set__Bit_new)
{
    dXSARGS;
    if (items != 1)
        croak("Usage: Set::Bit::new(nBits)");
    {
        int     nBits = (int)SvIV(ST(0));
        Vector *RETVAL;

        RETVAL = new(nBits);
        ST(0) = sv_newmortal();
        sv_setref_pv(ST(0), "VectorPtr", (void*)RETVAL);
    }
    XSRETURN(1);
}

We saw most of this code last month. The glue routine extracts nBits from the Perl stack, and declares RETVAL as a Vector *. Then it calls new(), and assigns the return value to RETVAL. Next, it creates a new mortal scalar at ST(0) on the Perl stack.

Now comes the magic: the glue routine calls sv_setref_pv to return RETVAL to Perl.

sv_setref_pv(ST(0), "VectorPtr", (void*)RETVAL);

perlguts documents sv_setref_pv like this

SV* sv_setref_pv(SV* rv, char* classname, PV iv);
Copies the pointer value (the address, not the string!) into an SV whose reference is rv. SV is blessed if classname is non-null.

SV stands for Scalar Value; an SV is the internal Perl data structure that represents a scalar. When we call sv_setref_pv(), it

Finally, the mortal at ST(0) is the return value of the new() method.

The package problem

This is very close to what we want. The return value from new() is a reference to a blessed scalar that holds a pointer to our C-language Vector struct. However, the scalar is blessed into the VectorPtr package, not the Set::Bit package.

This is a bit odd. Perl methods that function as constructors customarily return references to objects that are blessed into the method's own package. The

MODULE = Set::Bit               PACKAGE = Set::Bit

line in Bit.xs causes all our xsubs to be installed in the Set::Bit package. But Set::Bit::new() returns a reference to an object that is blessed into the VectorPtr package. This means, for example, that we could write

$set = Set::Bit::new(100);
and get a reference to a valid Perl object, but we couldn't write
$set->insert(42);

because $$set is blessed into the VectorPtr package, and the VectorPtr package has no insert() method. The insert() method is in the Set::Bit package, along with the new() method.

There are various ways around this problem. We could leave new() in the Set::Bit package, and add another PACKAGE directive to install the rest of our xsubs in the VectorPtr package

MODULE = Set::Bit               PACKAGE = VectorPtr

Or we could rename the whole module VectorPtr.

But these solutions are ugly and awkward. Module names shouldn't be dictated by the names of our C structs. Module names should be free parameters in our design, chosen to organize our libraries and to make our code read naturally. We very much want to get our object blessed into the Set::Bit package.

$ntype

As shown above, the code that blesses our object is
sv_setref_pv(ST(0), "VectorPtr", (void*)RETVAL);

To control the package that our objects are blessed into, we need to understand where the string "VectorPtr" comes from. If we look in the OUTPUT section of the default typemap, we find this entry

T_PTROBJ
        sv_setref_pv($arg, \"${ntype}\", (void*)$var);

So xsubpp emits code that blesses the object into the package named by $ntype. As discussed last month, $ntype is the type of the C variable that is being converted. In this case, the C variable is RETVAL, and its type is Vector *.

Vector * is an example of a C type name that is not a valid Perl package name. To accommodate this, xsubpp silently substitutes the string "Ptr" for the * before emitting $ntype as a package name. This is a general mechanism; for example, the C type name Vector * * would be converted to the Perl package name VectorPtrPtr.

RETVAL

RETVAL is the C variable that xsubpp declares to hold the return value of our C subroutine; therefore, its type must match the return type of our C subroutine. xsubpp declares RETVAL with the type that we specify for our XS routine in Bit.xs. Thus, the XS code
Vector *
new(nBits)
	int nBits

in Bit.xs leads to the C-language declaration

Vector *        RETVAL;

in Bit.c.

This means that the return type of our XS routine determines the package that our object is blessed into. In order to bless our object into the Set::Bit package, we must write our XS routine like this

Set::Bit
new(nBits)
	int nBits

It seems that this cannot stand, because

but we can get around both problems.

xsubpp handles the second problem for us. It squashes colons in $ntype to underscores whenever it emits $ntype as the type of a C variable. So the XS code shown above leads to this declaration for RETVAL in Bit.c

Set__Bit        RETVAL;

which is valid C code.

To handle the first problem, we simply add a typedef to Bit.xs, like this

typedef Vector *Set__Bit;

xsubpp passes this typedef through to Bit.c unchanged. Now the declaration of RETVAL is valid and correct.

xsubpp doesn't squash colons when it emits $ntype as a Perl package name, so the call to sv_setref_pv will be

sv_setref_pv(ST(0), "Set::Bit", (void*)RETVAL);

This will bless our object into the Set::Bit package, as we desire.

typemap revisited

After making these changes, Bit.xs looks like this
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include "vector.h"

typedef Vector *Set__Bit;

MODULE = Set::Bit               PACKAGE = Set::Bit

PROTOTYPES: ENABLE

Set::Bit
new(nBits)
        int nBits

If we now run

.../development/Set/Bit>make
we get
Error: 'Set::Bit' not in typemap in Bit.xs, line 14

The problem is that xsubpp is trying to find the return type of our XS routine in the typemap. We made an entry in the local typemap for Vector *, but then we changed the return type of our XS routine from Vector * to Set::Bit. We need to change the entry in the local typemap accordingly

Set::Bit        T_PTROBJ

Method call syntax

With this change, we can run
.../development/Set/Bit>make

and get a clean compile. Now, we try to create a Set::Bit object. We add this line to test.pl

my $set = new Set::Bit 100;

and run

.../development/Set/Bit>make test

The output is

1..1
ok 1
Usage: Set::Bit::new(nBits) at test.pl line 21.

The ok 1 means that the module loaded successfully. The usage message means that we called the new() method with the wrong number of arguments.

With a little thought, we can identify the problem. We invoked new() with method call syntax. In Perl, a method call

new Set::Bit 100

ultimately becomes a subroutine call with the package name passed as the first argument

Set::Bit::new("Set::Bit", 100)

To accommodate this, we need to change our XS routine to take two arguments: a string and an integer

Set::Bit
new(package, nBits)
        char *package
        int   nBits

If we run xsubpp on this XS routine, it will emit C code to call new() with both arguments

RETVAL = new(package, nBits);

We could change our C routine to take both arguments, and then ignore the package name. However, I try to avoid polluting my interfaces with implementation artifacts. Instead, we add a CODE: directive to the XS routine, and write the call to new() ourselves.

CODE:
RETVAL = new(nBits);

When we use the CODE: directive, we also need an OUTPUT: directive to tell xsubpp what the return value is

OUTPUT:
RETVAL

Our XS routine is now

Set::Bit
new(package, nBits)
        char *package
        int   nBits
        CODE:
        RETVAL = new(nBits);
        OUTPUT:
        RETVAL

When we run

.../development/Set/Bit>make test

we get a clean compile, and the output is

1..1
ok 1

showing that our call to new() succeeded.

The Set::Bit Object

Let's step back for a moment and review our situation. The glue routine for new() calls sv_setref_pv(), and sv_setref_pv() creates a Perl scalar. This scalar is the Set::Bit object. Here are three crucial facts regarding it

Here is a picture that illustrates this

	$set = new Set::Bit 100;

	reference 
	Name: $set 
	RV: ----------->scalar
		     	Package: Set::Bit
		     	IV: ----------------->Vector
			 		      {
					      }

RV stands for Reference Value. IV stands for Integer Value.

Because we have a reference to the Set::Bit object, we can make method calls on it. Whenever we make a method call, the reference is passed as the first argument to the method.

Methods may be written in either Perl or C. A method can dereference its first argument to get the value of the object. The value of a Set::Bit object is the machine address of a Vector struct.

There isn't much that a Perl method can do with the address of a Vector struct, except maybe print it.

sub Set::Bit::print_address
{
    my $set = shift;
    printf "%p", $$set;
}

A C method can pass the address of a Vector struct to any of the C-language subroutines defined in vector.c. Let's look at how this is done.

insert

The insert method inserts an integer into a set
$set = new Set::Bit 100;
$set->insert(42);

Here is the XS code to install insert() in the Set::Bit package

void
insert(pVector, n)
        Set::Bit pVector
        int      n

and here is the glue routine that xsubpp emits to call insert()

XS(XS_Set__Bit_insert)
{
    dXSARGS;
    if (items != 2)
        croak("Usage: Set::Bit::insert(pVector, n)");
    {
        Set__Bit        pVector;
        int     n = (int)SvIV(ST(1));

        if (sv_derived_from(ST(0), "Set::Bit")) {
            IV tmp = SvIV((SV*)SvRV(ST(0)));
            pVector = (Set__Bit) tmp;
        }
        else
            croak("pVector is not of type Set::Bit");

        insert(pVector, n);
    }
    XSRETURN_EMPTY;
}

Let's look at the C code to see how we get our Vector * back from Perl.

The XS routine declares pVector to be of type Set::Bit. As before, xsubpp squashes the colons to underscores for us, emitting the C declaration

Set__Bit        pVector;

and the typedef in Bit.xs makes Set__Bit an alias for Vector *.

The first argument to insert() is a reference to a Set::Bit object. This argument appears on the Perl stack at ST(0).

sv_derived_from() verifies that ST(0) is indeed a reference to a Set::Bit object. SvRV() dereferences ST(0), returning the scalar that was created earlier by sv_setref_pv(). SvIV() extracts the value of that scalar as an Integer Value (IV). This value is the Vector * that was stored there by sv_setref_pv().

The glue routine stores the IV in tmp, and then assigns it to pVector on the next line, with a typecast to quiet the compiler. Finally, it calls the C-language insert() routine, passing it the Vector * in pVector.

Instance methods

The XS code for the remaining instance methods is similar. Here is member()
int
member(pVector, n)
        Set::Bit pVector
        int      n

and intersect()

Set::Bit
intersect(pA, pB)
        Set::Bit pA
        Set::Bit pB

In each case, the first argument is a reference to the object on which the method was invoked. intersect() creates and returns a new Set::Bit object; the mechanics of this are the same as those described above for new().

union

The XS code for union() is a bit different. We can't have a C subroutine named union, because union is a keyword in C. I named the C subroutine Union, but I want the Perl method to be named union, so that the Perl interface will have a consistent naming scheme. With just a little more XS code, we can name our method and call it, too.
Set::Bit
union(pA, pB)
        Set::Bit pA
        Set::Bit pB
        CODE:
        RETVAL = Union(pA, pB);
        OUTPUT:
        RETVAL

We name the XS routine union, and then add a CODE: directive so that we can write the C call to Union.

DESTROY

When the last reference to a Set::Bit object goes away, Perl deletes the object. The object itself is a Perl scalar: Perl allocated the memory for it in the call to sv_setref_pv(), and Perl will free that memory.

The Set::Bit object holds a pointer to a Vector struct. Perl doesn't know anything about the Vector struct. We allocated it, and we are responsible for freeing it. We do this in the DESTROY method.

Perl calls the DESTROY method for us after the last reference to the Set::Bit object goes away, and before it deletes the Set::Bit object. Aside from the fact that Perl calls DESTROY() automatically, it is an ordinary method. Here is the XS code for DESTROY().

void
DESTROY(pVector)
        Set::Bit pVector

As with the other instance methods, the glue routine extracts the Vector * from the Perl stack and passes it to our C-language DESTROY() routine. Our C code then frees the storage for the Vector.

Perl methods

We mentioned above that a Perl method can't operate on a Vector *. However, we can use accessor methods (written in C) to get instance data from an object, and then operate on that data in Perl.

member() is an accessor method: it returns true iff its argument is an element of the set. With the member() method, we can write other methods in Perl. For example, the elements() method returns a list of the elements of a set

sub elements
{
    my $set = shift;
    grep { $set->member($_) } 0..$set->top
}

and the print() method returns a string containing a print representation of a set

sub print
{
    my $set = shift;
    join ", ", $set->elements
}

Given a suitable set of accessor methods, the choice of whether to implement other methods in C or in Perl can be made on the basis of performance and convenience.

A Perl object

Earlier, we said that we wanted the Set::Bit object to be the C-language Vector struct, rather than a Perl data object. It didn't work out that way. The Set::Bit object is indeed a Perl data object: it is the scalar created by sv_setref_pv().

Nonetheless, the Set::Bit object gives us the essential features of a C-language object

At the same time, the Set::Bit object gives us the flexibility to write methods in Perl. It is, perhaps, the best of both worlds.

Distribution

Here are sources and distribution for Math::Ackermann

Here are sources and distribution for Set::Bit

Align::NW

The Align::NW module does Needleman-Wunsch global optimal sequence alignment. It represents data in both C and Perl, and has methods written in both C and Perl. We'll discuss the techniques used to implement it—next month.

NOTES

trial and error
That's how I discovered it.
creates
The documentation is not quite explicit on this point.
get around both problems
Mr. E.V. Lambert of Homeleigh, The Burrows, Oswestly, has presented us with a poser. We do not know which bush he is behind, but we can soon find out.
<BOOM>
<BOOM>
<BOOM> <yyaaahhhhh!>
Yes, it was the middle one.
- Monty Python, How Not to be Seen
loaded successfully
which is never to be taken for granted when writing XS modules
usage message
We get usage messages because we enabled prototypes in Bit.xs.
integer value
as opposed to a string value

Creative Commons License
XS Mechanics by Steven W. McDougall is licensed under a Creative Commons Attribution 3.0 Unported License.
Steven W. McDougall / resume / swmcd@world.std.com / 2000 February 10