Computational Biology

During the 1960s, protein chemists began to collect sequences resulting from their own research, from that of their colleagues, and from the literature. Dayhoff and Ledley, who assembled the first major collection of genetic sequence information in the form of atlases of protein sequence and structure, were at the forefront of these efforts. During the 1970s, Berman and colleagues, who assembled a database of X-ray crystallographic protein atomic coordinates, and Kabat and colleagues, who constructed a database on the structure and diversity of im-munoglobins, extended these efforts. Additionally, during the 1970s, two computer analysis systems available online, the time-sharing molecular analysis (PROPHET) system, and the Stanford University Medical Experimental Computer Resource (SUMEX), had been developed by government-sponsored groups or in academia.

These databases and computer analysis systems were part of a larger attempt to demonstrate the potential of database facilities for storage and retrieval of molecular data and the importance of computer support for sequence analysis. In 1979, scientists assembled at a workshop sponsored by the National Science Foundation had reached a consensus on the need to establish an international computer database for nucleic acid sequences and recommendations for its establishment had been formally outlined. They knew that computer programs capable of processing large amounts of data would be required to facilitate storage and editing of molecular sequences, to produce copies of a sequence in various forms, to translate into the amino acid sequence that was encoded by a DNA sequence, to search a sequence for particular shorter sequences, to analyze codon usage and base composition, to compare two sequences for homology, to locate regions of sequences that were complementary, and to translate two sequences showing amino acid similarities.120 Unfortunately, immediate progress toward this goal stumbled because a report of this workshop was not widely distributed, preventing a broad-based discussion of its findings. Subsequent meetings convened in 1980 in rapid succession, notably in Schonau, Germany, on April 24, 1980, and at the National Institutes of Health (NIH) on July 14 and on August 11, 1980, created a sense of urgency about this issue. And submission of proposals on alternative approaches to establish such a resource and further prodding by a number of researchers helped refocus attention on the need to establish a center dedicated to sequence collection and analysis.6,7 At a final meeting convened at the NIH on December 7, 1980, guidelines were set forth to implement such a project in two phases. Phase I was to establish a centralized nucleic acid sequence database and Phase II was to establish an analysis and software library coupled to the database. In the project implemented, however, only Phase I was supported while Phase II was postponed. Thus, on June 30, 1982, the National Institute of General Medicine announced the award for the establishment of GenBankĀ®, the nucleic acid sequence data bank.

While these attempts to establish GenBank moved ahead in fits and starts, investigators continued to apply rapid DNA sequencing techniques to predict the complete sequence of specific proteins.124,125 In one notable instance that was concluded in 1978, J.G. Sutcliffe utilized Sanger's ''plus-and-minus'' method56 to determine the nucleotide sequence of the ampicillin resistance (penicillinase) gene of Escherichia coli that encoded a p -lactamase of approximately 27,000 daltons. Sutcliffe conducted his research in Walter Gilbert's laboratory at Harvard University, and Gilbert knew that investigators in Jeremy Knowles' laboratory at the University of Edinburgh were engaged in direct studies to determine the peptide composition and amino acid sequence of that enzyme.126 Within 7 months of initiating the project, Sutcliffe had completed the nucleotide se-quence125 and his findings about the hypothetical protein derived from the translation of the DNA sequence were in complete agreement with the amino acid sequence observed in Knowles laboratory.127 Sutcliffe's study was the first to demonstrate that the derivation of the primary sequence of a protein from its nucleotide sequence was much faster and easier than the sequencing of the protein. His study made a powerful statement about the value of sequencing methodology.126

As the number of protein molecules and nucleic acid fragments for which sequences had been determined expanded, the need for efficient, rapid, and economical methods to conduct similarity searches became even more obvious. Early in the 1980s, several different methods were in use for analyzing such similarities, and at that time, all of the software search tools used some measure of similarity between sequences to distinguish biologically significant relationships from chance similarities. Existing methods that had been implemented could be divided into two categories: those for global comparisons where two complete sequences were considered, and those for local searches where the search was limited to similar fragments of two sequences; but these older methods were computationally intensive and expensive when applied to large data banks.128

In 1983, Wilbur and Lipman partially resolved this problem by developing a new algorithm that yielded rigorous sequence alignments for global comparisons.

The method substantially reduced search time with minimal loss of sensitivity.128 Using this algorithm, the entire Protein Data Bank of the National Biomedical Research Foundation (NBRF) could be searched in less than 3 minutes and all eukaryotic sequences in the Los Alamos Nucleic Acid Data base in less than 2 minutes. Waterfield and associates employed this technique to demonstrate that the sequence of platelet-derived growth factor was related to the transforming protein p28sis of simian sarcoma virus.129 Later, other rapid algorithms such as FASTA and related versions of this program were developed permitting large databases to be searched on commonly available minicomputers.130

In 1988, the National Center for Biotechnology Information (NCBI) at the NIH was created to house and develop information systems for use in molecular biology. The NCBI was assigned responsibility for maintaining GenBankĀ® and began to provide data analysis, data retrieval, and resources that operated on the data in GenBank. Since its creation, the purview of the NCBI has greatly expanded and the suite of resources and services currently offered can be grouped into seven categories: (1) database retrieval systems, (2) sequence similarity search programs, and resources (3) for the analysis of gene-level sequences, (4) for chromosomal sequences, (5) for genome-scale analysis, (6) for the analysis of gene expression and phenotypes, and (7) for protein structure and modeling, all of which can be accessed through the NCBI homepage http:/www.ncbi .nlm.nih.gov.15

In 1990, Altschul and colleagues introduced the basic local alignment search tool (BLAST) as a new approach to rapid sequence comparisons.131 The basic algorithm was applicable to a variety of contexts including DNA and protein sequence database searches, motif searches, gene identification searches, and the analysis of multiple regions of similarity in long DNA sequences. BLAST had the advantage of being an order of magnitude faster than existing search tools of comparable sensitivity. Since the original version was published, several new versions of BLAST have been developed that improved sensitivity and perfor-

132,133

mance or were customized for high-performance computing. ,

The most frequent type of analysis performed on GenBank data uses BLAST to search for nucleotide or protein sequences similar to a query sequence. To facilitate searches for other purposes or that require other approaches, the NCBI offers specialized versions of the BLAST family and customized implementations of the BLAST family of programs to augment many applications.15

For more than 15 years, computational methods for gene finding were based on searches for sequence similarities. These well-established methods proved successful in many cases, but a follow-up study showed that only a fraction of newly discovered sequences had identifiable homologs in current databases.134 For example, based on sequences in GenBank 2000, when only ^25% of the human genome was available, results of the study suggested that only about half of all new vertebrate genes might be discovered by sequence similarity searches. Recently, another computational approach, the template approach, has been developed. The template approach, more commonly referred to as ab /n/f/o gene finding, combines coding statistics with signal sensor detection into a single framework. Coding statistics represent measures of protein coding functions, whereas signal sensors are short nucleotide subsequences that are recognized by cell machinery as initiators of certain processes.135 While many different nu-cleotide patterns have been examined as signal sensors, those that are usually modeled include promoter elements, start and stop codons, splice sites, and poly(A) sites. Rogic and colleagues have conducted a comparative analysis of several recently developed programs (FGENES, GeneMark.hmm, Genie, Genscan, HMMgene, Morgan, and MZEF) that incorporate coding statistics and signal sensors. The analysis examines the accuracy of predicting gene structure as a function of various structural features such as the G + C content of the sequence, length and type of exons, signal type, and score of exon prediction. The analysis shows that this new generation of programs provides, overall, a substantial improvement over previous programs in predicting some of the complexities of gene structure, and is an important step in deciphering the content of any

0 0

Post a comment