The PHRED/PHRAP/CONSED software tools developed by Green and colleagues for DNA sequence analysis should be mentioned as programs that have received recognition and worldwide use in the genomic community.24,25 A key feature of these programs is their emphasis on objective criteria to measure the accuracy of sequences and assemblies. PHRED reads DNA sequencer trace data obtained from the dideoxy chain-termination method ofSanger,3 calls bases, and assigns quality values (log transformed error probabilities) to the data. After calling the bases, PHRED writes the sequences in either a FASTA or Standard Chromatogram Format (SCF). PHRAP is a program that assembles shotgun DNA sequence data. It uses a combination of user-supplied and internally computed data to improve accuracy of assembly in the presence of repeats, and constructs a sequence of the highest quality parts of reads. CONSED is a graphic tool for editing PHRAP assemblies into a finished sequence. CONSED implements the finishing strategy. It allows the user to edit the sequence and uses error probabilities from PHRED and PHRAP to guide the editing; the program can also select primers and templates for locations specified by the user, and it automates the process of choosing reads for finishing the sequence.136

Thus, comparative analyses of DNA and protein sequences have become an indispensable part of biological research. In 1999, the NIH launched the Mammalian Gene Collection ( MGC) program. This program combines and illustrates many of the key features that exemplify computational biology. As originally envisioned, this program was sponsored by 16 NIH institutes and the National Library of Medicine and is led by the National Cancer Institute and the National Institute for Genome Research. It includes components for the production, analysis, and distribution of libraries, clones, and sequences, and technology development, and its major goal was to obtain and identify a full set of human and other mammalian full-length sequences and clones of expressed genes.137 In the first report of this effort, the MGC program generated and performed initial analysis of more than 15,000 full-length human and mouse cDNA sequences.137 Among existing gene-identification programs, the prediction of the coding sequence (open reading frame, ORF) of typical genes is an important first step in deciphering the gene content of any genome. In the human portion of the study, a total of 12,419 full ORF human cDNA clones that corresponded to 9530 distinct human genes were sequenced to finished standards. Candidate full ORF clones for an additional 7800 human genes were also identified. This evidence combined with that from the carefully annotated sequence of chromosome 22 indicated that the MGC consists of 52% of all human genes, and that it will grow to 67% in the near future.138

0 0

Post a comment