Searching EST data

As discussed earlier ESTs represent partial cDNA sequence and are therefore less problematic to search than genomic sequence, although in most cases only partial sequence can be assembled. Other factors to consider are that some GPCRs are expressed at low levels in a tissue and will therefore be underrepresented in cDNA libraries and hence EST databases.

A gene mining session would commonly start with a BLAST search against an EST database. To begin a search a query sequence needs to be selected. One point to consider is the domain structure of the query sequence. If other domains are present results maybe returned which are not pertinent to the intended query. For instance the rhodopsin family glycoprotein hormone receptors contain a 7 transmembrane domain (TMD) spanning region but also multiple leucine rich repeat (LRR) domains (Jiang et al. 1995). A search with this type of receptor will identify other rhodopsin GPCRs but will also return a large number of LRR containing non-GPCR sequences.

Once the query protein sequence has been selected the nucleotide database should be searched with TBLASTN. We could search the database using a nucleotide sequence and BLASTN, however, far better results will be obtained by using TBLASTN and a protein query, since a gene family will show greater conservation at the amino acid level than at the nucleotide level. Once the results are returned, the alignments (high scoring pairs or HSPs) can be viewed. Knowledge of the conserved residues that describe the GPCR family is of great advantage in looking through the matches. The E-value, a statistical score based on the probability of finding an exact match in the database by chance, can also be used to judge the significance of the match. For globular proteins an E-value of 10-3 or lower is deemed significant, however, the structural constraints for related globular proteins are far more rigid than for GPCRs. This is due to the seven transmembrane spanning regions, which can vary more widely in their amino acid composition of hydrophobic residues, but still retain the ability to form a hydrophobic membrane spanning helix. E-values of 9.0 or higher can be obtained and still represent family members.

When an EST match has been identified, the clone identifier should be used to check for 5' or 3' partner ESTs. As described above, consulting UNIGENE and TIGR HGI will add value to the match by finding overlapping and 5'-3' linked ESTs. The sequence quality of an EST can be poor, but using the tool ESTWISE (http://www.sanger.ac.uk/Software/Wise2/) allows frame shifts to be tracked through the sequence and extend the region of identity further than that found with BLAST. ESTWISE can also be run as the primary search tool on an EST database. A successful EST mining approach that led to the identification of five novel rhodopsin family GPCRs was recently reported (Wittenberger et al. 2001).

0 0

Post a comment