The nucleotide sequence component of the public databases can be divided into two main categories, namely complimentary DNA (cDNA) derived sequence and genomic sequence. Historically, the cDNA sequences present in the databases would be derived from focused gene cloning experiments. These would therefore represent well-characterized sequence for a defined gene. More recently the advent of improved DNA sequencing methodologies has allowed large-scale genome projects to be undertaken generating raw sequence data for which the function is unknown. Uncharacterized sequence therefore provides a good starting point for novel gene identification and annotation. Figure 9.1 provides an overview of the sequences present in the public databases.
The first of the high throughput sequencing projects to make its mark was the public and private EST sequencing efforts (Boguski et al. 1993). EST sequences are approximately 500
Unfinished/unordered (high throughput) genomic: gb_htg Finished genomic: gb_pr cDNA: gb_pr
High throughput cDNA: gb_htc Expressed sequence tag, EST: gb_est
Protein: SWISSPROT, genpept, pir
5' to 3' orientation 13 Exon (transcribed region), filled box represents coding region
Fig. 9.1 Describes the relationship of unfinished genomic sequence through to protein sequence. The GENBANK subdivision for each sequence source is given next to the text description.
nucleotide long sequence reads from the 5' and 3' end of cloned cDNAs. The sequences are single pass reads and therefore the quality can be poor. cDNA cloning artifacts are also represented in EST databases. These artifacts include miss-priming events and aberrant transcripts resulting in ESTs representing truncated cDNAs, the 5' and 3' annotation of the EST being in reverse orientation with respect to the genuine transcript, and contamination of un-spliced genomic sequence. Depending on the quality of the cDNA library which is sequenced the ESTs may represent the true 5' and 3' ends of a cDNA insert; however, for larger mRNAs, only partial cDNA sequence may be provided. In this case the derived 5' EST sequences may fall in the coding region of the true transcript or may solely represent 3' untranslated region sequence.
Greater value can be obtained from EST sequences by assembling overlapping EST reads. This can be further enriched by using the 'clone identifier' information, allowing discreet EST clusters (contigs) to be linked by 5'-3' information independently of sequence overlap. This type of analysis can allow researchers to walk through an unknown cDNA sequence rapidly. For an example of this method see the Human Gene Indices and TIGR Human Consensus (THC) resource (http://www.tigr.org/tdb/tgi.shtml). A related approach is taken by the NCBI's UNIGENE, a system which automatically partitions GenBank sequences into a non-redundant set of gene-oriented clusters. This captures both cDNA and EST sequence as well as comparisons to model organism sequences such as mouse, Caenorhabditis eleg-ans, Drosophila melanogaster and Saccharomyces cerevisiae. Unlike TIGR the related EST sequences are collected into sequence 'bins' but the assembled EST contigs are not available. The largest representative sequence can, however, be obtained.
High throughput cDNA (HTcDNA or HTC) represents unfinished cDNA sequence and may include 5' UTR and 3' UTR regions and coding region (http://www.ebi.ac.uk/embl/Documentation/Release_notes/relnotes66/relnotes.html#htc). Example of these cDNAs are those produced by the RIKEN Genomic Sciences Center (Kawai et al. 2001). These can represent full length or partial transcripts. Once the sequences are finished they are moved to the appropriate taxonomic division of the database.
The majority of human nucleotide sequence in the public database is now derived from the human genome sequencing project. This was initiated during the late 1980s and launched in 1990, for a background see (Lander et al. 2001). Similarly to cDNA sequence, the quantity of genomic derived sequence in the public database has seen a rapid expansion. Previously to the high throughput sequencing projects, genomic sequence would be derived through specific gene focused projects. Routinely these would be cosmid vector cloned inserts of around 30 kb or may only be the sequence around exons.
The advent of high throughput genome sequence has added three further representations of genomic sequence to the database. The public human genome sequencing effort has focused on sequencing bacterial artificial chromosome (BAC) clones, which can hold around 200 kb of DNA. Each BAC clone is 'shotgun' cloned into smaller vectors, these are sequenced and contiguous reads which represent a sub-sequence of the BAC are built up, at this point the BAC sequence is given an accession number and submitted to the database as unfinished high throughput genomic sequence (HTGS). The order and orientation of the contigs are unknown until the sequence of the BAC is completed. This makes gene hunting in unfinished sequence problematic. Once the genomic sequence is completed, the entry is moved from the HTGS division of the database into its taxonomic division. Depending on the sequencing centre, biological sequence features such as predicted genes, repeat elements etc are annotated in the entry.
Genome Survey Sequence (GSS) represent a third division of database which contains raw genomic sequence, GSSs are end reads of BAC clones, approximately 500 bp in length, the genomic equivalent of ESTs. Their primary function is to aid the assembly of BAC clones into overlapping assemblies. Exonic sequence can be found in these small genomic reads and therefore GSSs can be used for novel gene hunting. This is exemplified in the identification of a second melanin-concentrating hormone receptor, MCH2, cloned via the identification of a putative coding exon in a GSS sequence (Hill et al. 2001; Sailer et al. 2001).
Was this article helpful?