Hacking the Mother Code

Robert Robbins, NSF program director, Database Activities, and DOE acting director, Genome Informatics Activities, describes the problems of genomics in computer terms.
Imagine that you are given a tape containing 3.5 gigabytes of binary files for a large set of application programs and for some future release of Unix (version 20,000,000, say). The tape is not a file-by-file copy, but rather a streaming dump of the bytes in the order they occupied on disk. You know many of the files on the disk are fragmented and that some of the 3.5 gigabytes represent erased files or other garbage from unused portions of the disk. Your assignment? Discard the garbage, then reassemble the fragmented executables and reverse compile them back to the source code. Finally, reverse engineer the source code into design specs.

If that's not challenging enough, bear in mind that you have only partial and possibly incorrect documentation for the hardware on which the codes run. You don't know all the details of the CPU, the registers it has, or the OP codes it uses. Also, remember that your 3.5 gigabytes of programs are the result of 20,000,000 maintenance revisions performed by the worst possible set of kludge-using, spaghetti-coding hackers who delight in clever tricks like writing self-modifying code and relying upon undocumented system quirks.

...Remember that 3.5 gigabyte tape that had to be reverse engineered? Well, you don't really have that tape—getting it is your first problem. The technology doesn't exist to read the entire genome (the biological "disk" to tape. But it is possible to obtain small fragments from random locations on the disk. If you have enough random fragments, you can compare them with each other, detect overlaps, and begin assembling larger and larger portions of the whole tape. To finish the job, many random fragments will be required. If each fragment is about 1/50,000th the size of the disk, then about 250,000 are needed in order to have a 99% chance that every byte on the disk is contained in at least one fragment.

To complicate things, errors occur so that some of the fragments are really the concatenation of two or more pieces from different locations on the disk. Also, identical substrings often occur at different locations on the disk, leading to false overlap detections. Finally, it would cost too much to "read" the byte sequence of every one of these fragments directly, so first you must determine the minimum spanning set of fragments. This requires you to devise and perform some partial characterization of each fragment that can be used in an exhaustive set of pairwise computations to generate a 60-billion entry probability-of-overlap matrix from which you can attempt to deduce the minimal spanning set. All of your characterization, comparison, and assembly algorithms must take into account the possible occurrence of random or systematic errors at every point.

Karen A. Frenkel, "The Human Genome Project and Informatics", Communications of the ACM, Vol. 34, No. 11, 1991, pp 41-51.

Copyright ACM. Used by permission.


Notes

compare them with each other, detect overlaps
this is the sequence alignment problem

Steven W. McDougall / resume / swmcd@world.std.com / 1999 October 21