Searching genomic sequence

Mammalian genomic sequence can be queried in two ways, first directly by searching the DNA sequence using tools such as BLAST, or second by searching predicted open reading frames from genomic sequence. The latter will be dealt with in genome annotation projects section below. Genomic sequence can be approached initially in a similar way to ESTs. However, genomic sequence contains intron breaks between coding regions, which can result in the matching HSPs being short in length and numerous for a multi-exonic gene. Once the original HSP is found, progression of the potential novel gene can be greatly enhanced by examining the genomic DNA and building a gene model. The aim of building a gene model is to try and identify the coding exons and build a virtual cDNA. Unlike EST sequence sources, it is possible to assemble the complete coding region from genomic sequence if the whole gene is there; this can speed up time in the lab significantly. One difficult challenge to overcome, however, is the fact that much of the genomic BAC sequences contain unordered sequence fragments or contigs, which could split a gene of interest over a number of contigs. To overcome this problem, each contig should be treated as a separate sequence; these can be turned into a BLAST database and searched with the target sequence. Contigs that contain consecutive HSPs can be reassembled in order and the prediction programs can then be run. As genomic sequencing comes to a close, genome assembly projects such as Golden Path, will negate this need for contig reassembly (Lander et al. 2001). Figure 9.2 describes a gene model building project.

Although web tools are available to help gene model building, access to a UNIX system and sequence manipulation tools such as EMBOSS (Rice et al. 2000) suite of programs allows greater working flexibility. Furthermore many of the advanced gene prediction tools are faster to use at the UNIX command line level and can handle larger sequences for analysis.

First we should discuss some of the tools available to aid gene prediction. Gene prediction programs fall into two groups, ab initio and homology driven, and both have their place in building the gene model. Greatest success is achieved when the results from both methodologies are combined and compared. Prior to running the prediction programs the genomic sequence to be studied should be masked for genomic repeats using programs such as REPEATMASKER (http://itp.genome.washington.edu/cgi-bin/RepeatMasker).

Ab initio, prediction programs look for gene signals in the raw genomic DNA and build an exon model. Table 9.1 provides a list of gene prediction software. The sequence signals used by prediction programs include the GC content of coding versus non-coding DNA, identification of open reading frames, splice site prediction and in some cases promoter assignment and polyadenylation signal prediction. One of the most commonly used of these tools is GENSCAN (Burge and Karlin 1997). GENSCAN uses a Hidden Markov Model of a gene to predict the virtual sequence. It should be noted that false exons can also be predicted, genes spliced together, exons missed and genes split apart. Care therefore needs to be taken when using output from these programs. Additionally in the case of rhodopsin family of GPCRs many members have been found to be single coding exon genes (Gentles and Karlin 1999) therefore these can be missed by exon prediction programs.

Fig. 9.2 A gene model building flowchart.

Table 9.1 Sample of gene prediction programs and their relevant web sites

Ab intio

Web site

Reference

GENSCAN

http://genes.mit.edu/GENSCAN.html

Burgeand Karlin 1997

FGENESH

http://genomic.sanger.ac.uk/gf/gf.shtml

Solovyev and Salamov 1997

MZEF

http://argon.cshl.org/genefinder/

Zhang 1997

GRAIL II

http://compbio.ornl.gov/public/tools/

Xu et al. 1994

HMMgene

http://www.cbs.dtu.dk/services/HMMgene/

Krogh 1997

Homology driven

GENEWISE

http://www.sanger.ac.uk/Software/Wise2/

Birney and Durbin 2000

FGENESH+

http://genomic.sanger.ac.uk/gf/gf.shtml

See web site

PROCRUSTES

http://www-hto.usc.edu/software/procrustes/

Gelfand etal. 1996

Homology driven tools compare a protein sequence, or profile of a gene family against the genomic DNA, these incorporate a splicing model, such that the identity can be split across exons, but the phase of the exon boundary must be maintained. GENEWISE (Birney and Durbin 2000) is a good example of this type of application, although slower to run than the ab initio programs, excellent results can be achieved. It is worth comparing the local alignment option with the global alignment, as this can help identify the N-terminal exons. A list of gene prediction programs is provided in Table 9.1.

Further to using prediction tools, the region of masked genomic sequence should also be compared to EST databases, this can support the exons predicted by the gene finding tools. Use of ESTs to support predictions is exemplified in the identification of the Family 3 receptor GPRC5B (Robbins et al. 2000). It is important to remember that ESTs can also be found in untranslated regions, matches which do not lie in a coding exon but say close to a predicted stop codon add further evidence to a gene being present. Furthermore, comparison to other syntenic vertebrate genomic sequence can highlight phylogenic footprints supporting evidence for coding exons, a useful tool for such analysis is PIPMAKER (Schwartz et al. 2000) (http://bio.cse.psu.edu/pipmaker/).

Once the prediction tools have been run and comparisons to ESTs and other genomes made, the exons need to be assembled and compared back to the family of interest. It is at this point that the model is handcrafted such that the prediction makes the most biological sense, that is, a predicted exon which inserted a large sequence in the middle of a transmembrane helix would not be likely to occur. This requires assembling the high confidence exons, adhering to splice junction rules. Visualization tools such as GENOTATOR can help the researcher compare the output from the prediction tools and EST searches (Harris 1997) (http://www.fruitfly.org/~nomi/genotator/). The process should be iterated until the optimal prediction is achieved.

0 0

Post a comment