My Adventure using Computer Science on the Genome Project

Webb Miller

A considerable number of computer scientists have become involved in the Human Genome Project. Their backgrounds before making that career change, the personal motivations for the decision, and their choices of new problems to attack vary tremendously. I can only tell you one of the many stories --- mine.

Around 1987 I decided to drop everything and spend all of my research time on the Genome Project. This represented a radical change, since my background in the area was almost non-existent. I hadn't even taken a biology course in high school or college. My degrees were in mathematics, and most of what I had worked on since graduation was completely unrelated to biology. However, I had a strong interest in, and a little knowledge of, algorithms, particularly algorithms for comparing sequences, and I enjoyed writing programs, both of which have been tremendously useful for my foray into bioinformatics.

A number of factors influenced my decision. A key element was my friendship with Gene Myers, which began when we were both Computer Science faculty members at the University of Arizona. Gene had begun working on computer applications in molecular biology around 1980. Also, interest in human biology is almost mandatory when one's body gets to be the age of mine. But basically, I was drawn by the excitement surrounding what promised to be one of the most important achievements of mankind at the turn of the millennium. I had gotten into computer science in the 1960's, when no one knew what computer science was. By 1987, computer scientists were coming to work at 9:00 a.m. carrying briefcases. I missed the frontier spirit. In 1987, no one had even thought of a name for what I was doing; ``computational molecular biology'' and ``bioinformatics'' came later.

What made it relatively painless to embark on an entirely new career was that I was already a tenured full professor and that Penn State has been very supportive. Also, in 1987 there was very little literature in computer methods for sequence analysis. I needed only to study the papers of Michael Waterman. Nowadays, one has a mountain of papers to work through, such as the yearly proceedings of RECOMB, ISMB and the Pacific Symposium, plus the Journal of Computational Biology, and Bioinformatics. After a few years I stopped paying close attention to such papers, and I now rarely read them except in my capacity as a referee. Instead, I read as many biology papers as I can fit into my schedule. Because my goal is for biologists to read my papers, I need to read theirs.

I began working in bioinformatics in 1987 by reading papers on algorithms for comparing biosequences and working with Gene Myers to improve those results. After a couple of years I obtained NIH support for two graduate students. With each grant renewal, the picture improved, and currently I work with three colleagues, Zheng Zhang, Scott Schwartz and Cathy Riemer, who are entirely supported by my NIH grant. Compared with relying on a new crop of graduate students every four years, this makes a world of difference in the speed with which projects can be completed.

After a couple of years trying my hand at various problems, I staked a claim on the particular area where I intended to make a contribution to the Genome Project. The official plan for the Genome Project as of 1990 called for serious sequencing to begin half-way through the 15-year effort, say around 1998, with completion of the human sequence by 2005. Sequencing of several other organisms was also planned, though all of the animals officially earmarked for sequencing were primitive -- nothing more complex than a fly. However, back then I believed that to make sense of the human sequence it would be necessary to compare it with the sequence of another complex animal, such as a mouse. (I retain this belief.) The basic idea is that most of the genomic sequence appears to perform no function, and so it is free to drift via a variety of mutational mechanisms. Any piece of the sequence that is necessary for survival of the animal will change more slowly, since most mutations in it will be detrimental and hence tend not to be propagated. Thus, a functional region can be detected by virtue of being relatively similar between the two species, even if we have no idea of what function it performs. My guess was that after determining the human sequence, the tremendous infrastructure would keep running and finish the mouse sequence by, say, 2008.

My chosen goal was to figure out how best to compare the human genome sequence with, say, the mouse genome sequence, so as to locate the critical regions that cannot easily be found by other means. Among my many reasons for picking this particular problem, perhaps the most influential was that in 1990 I established a collaboration with a Penn State biologist, Ross Hardison, whose work revolves around using genomic sequence conservation to detect putative functional regions of DNA, which are then tested in the laboratory to identify and verify the particular function. Since then, my career has been inextricably tied to Ross's. We have jointly managed several research grants and co-authored three dozen papers. Other motivations were that the potential scientific payoff for success in this project seemed worth the effort, and that it appeared to me that a certain amount of interesting computer science would be necessary to crack the problem.

To make a long story short, I have pursued that goal for ten years. My efforts ranged from biology (e.g., [2]), to sequence comparison algorithms (e.g., [6]), to user interfaces. Whereas even by 1987 people had accumulated a considerable level of experience comparing protein sequences, the new challenge was to effectively compare two or more genomic DNA sequences. While protein sequences are always short, rarely more then 5000 letters long, a genomic sequence can contain millions of letters. New sequence comparison algorithms were needed, and graphical interfaces were required to browse the results. A 1997 paper [3] reviews some of the progress, and the website https://globin.bx.psu.edu/pipmaker gives a more current view. I believe that my group has developed the basic capability for comparing the human and mouse genome sequences, with several important enhancements to be completed by the time the data arrives in large quantities. My attack on the problem was diversified. An essential component was to become an expert at solving the sorts of problems that my software was intended to facilitate. In other words, to know what would assist biologists I needed to become one of them. Ross Overbeek, who very successfully made the transition from computer scientist to biologist, once told me that it takes three years. However, it took me more like ten. Also, it proved to be surprisingly successful for me to help other bioinformatics specialists solve their chosen problems, particularly with David Lipman and Bill Pearson, who develop programs to search protein sequence databases. My efforts (e.g., [1,6]) were very satisfying because of their popularity with biologists, and frequently the resulting ideas turned out to be useful for my chosen problem domain. Much less useful in my hands was the ``solution in search of a problem'' approach, i.e., starting with some nifty ideas from computer science and determining how they might be used in bioinformatics. This is a great way to generate papers, but for me it was only marginally successful for helping biologists.

The year 2000 will be a wonderful time to begin a career at the interface of biology and computer science. The questions that have occupied me for the last decade may, however, not be the best place to start, though you are welcome to try. I believe that the soon-to-be completed projects to sequence the human and mouse genomes are so obviously successful that they will whet the appetites of scientists and legislators for additional Big Biology projects. These people won't be left with the memories of cost over-runs and the feeling of ``So what?'' that have plagued other Big Science initiatives. If you would like to get an idea of where the field is headed, I recommend a paper [4] by Eric Lander.

If you decide to try your hand at bioinformatics, then perhaps we will meet. When I charted my goal in 1990, I was aware that my prediction for completion of the mouse sequence, 2008, coincided with me reaching retirement age. Thus, if the prediction had been accurate, I could finish my project and then turn my attention to some hobby. As it turns out, I'll soon need to identify another goal, because completion of the human and mouse sequences is way ahead of that schedule. Thus, it may be that in a few years I'll be reading your words in the SIGBIO Newsletter, seeking guidance for my next adventure in bioinformatics.

References

Altschul, S., T. Madden, A. Schaffer, J. Zhang, Z.Zhang, W. Miller and D. Lipman (1997) Gapped BLAST and PSI-BLAST -- a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.

Elnitski, L., W. Miller and R. Hardison (1997) Conserved E boxes function as part of the enhancer in hypersensitive site 2 of the beta-globin locus control region: Role of basic helix-loop-helix proteins. J. Biol. Chem. 272, 369-378.

Hardison, R., J. Oeltjen and W. Miller (1997) Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Research 7, 959-966.

Lander, E. (1996) The new genomics: global views of biology. Science 274, 536-539.

Pearson, W. R., T. Wood, Z. Zhang and W. Miller (1997) Comparison of DNA sequences with protein sequences. Genomics 46, 24-36.

Zhang, Z., P. Berman and W. Miller (1998) Alignments without low-scoring regions. J. Comput. Biol. , 197-210.