Info

Human ESTs homologous to yeast MRS4 | Unigene |

Identification of different EST clusters —► already known genes —► yet unknown genes

Identification of different EST clusters —► already known genes —► yet unknown genes

Identification of a human genomic clone (working draft sequence) containing the EST contig

pTranslation of EST contig sequence

Determination of the potential human MRS4 protein

Experimental work to check the in silico cbning and to determine the eventual parts of the sequence missed

Fig. 2. Schematic of the method used for human MRS4 sequence retrieval through homology with yeast MRS4 protein.

Protein query—Translated db [tblastn] in hypertext format) which compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. The memorized amino acid sequence previously retrieved from the YPD database should be copied and pasted [CtrlV] in the query window. It would be also possible to only enter the accession number or GI of the protein sequence. The database to be used is estMuman. The result of the search is obtained by clicking on Blast! and then on Format! on the next screen. It appears as a scheme showing the various retrieved human ESTs with lines of different colors, each color corresponding to a homology score (red for score >200, pink for score between 80 and 200, green for score between 50 and 80, etc.). The list of the human ESTs is then presented with the score and the E value given for each alignment. The meaning of all these variables is explained in the introductory pages of the BLAST program. The alignments of the query sequence with all translated ESTs are shown in order of decreasing homology scores. Several indications help determine the likelihood of having really retrieved ESTs corresponding to the human counterpart of the protein of interest (in our case MRS4 protein): the score, the E value, the identities, and the gaps between the query sequence and the retrieved EST (Fig. 3). Frame = +1 indicates that the EST sequence has been translated in the first 5' frame. As EST sequences are "single-pass" sequences, they are prone to contain errors. Alignment of the yeast protein sequence with the human protein sequence obtained using the six possible reading frames allows the identification of homology despite potential errors. Thus, in the chosen example, MRS4, two blocks of homology are identified for two reading frames. The sequences presenting the

AV704087 ADB Homo sapiens cDNA clone ADBAFE05 5'.

Length = 681

Identities = 53/114 (46%), Positives = 69/114 (60%), Gaps = 2/114 (1%)

Query: 120 PMKTALSGTIATIAADALMNPFDTVKQRLQLDTNL--RVWNVTKQIYQNEGFAAFYYSYP 177 P + +G +AT+ DA MNP + VKQR+Q+ + RV + + ++QNEG AFY SY

Sbjct: 334 PATKSAAGCVATLLHDAAMNPAEWKQRMQMYNSPYHRVTDCVRAVWQNEGAGAFYRSYT 513

Query: 178 TTLAMNIPFAAFNFMIYESASKFFNPQNSYNPLIHCLCGGISGATCAALTTPLD 231 T L MN+PF A +FM YE + NPQ YNP H L G +GA A TTPLD

Sbjct: 514 TQLTMNVPFQAIHFMTYEFLQEHXNPQRLYNPSSHVLSGASAGAVAARATTPLD 675

Identities = 41/94 (43%), Positives = 60/94 (63%), Gaps = 1/94 (1%) Frame = +3

Query: 13 DYEALPSHAPLHSQLLAGAFAGIMEHSLMFPIDALKTRVQAAGLNKAAS-TGMISQISKI 71

DYEALP+ A + + ++AGA AGI+EH +M+PID +KTR+Q+ + AA ++ + +1 Sbjct: 69 DYEALPAGATVTTHMVAGAEAGILEHCVMYPIDCVKTRMQSLQPDPAARYRNVLEALWRI 248

Query: 72 STMEGSMALWKGVQSVILGAGPAHAVYFGTYEFC 105 - yeast MRS4

EG +G+ GAGPAHA+YF YE C A- conserved amino acids

Sbjct: 249 IRTEGLWRPMRGLNVTATGAGPAHALYFACYEKC 350 <- human retrieved EST

Fig. 3. Result of a BLAST search with the yeast MRS4 amino acid sequence. Alignment of the yeast and human sequences identified two blocks of homology corresponding to two different reading frames (amino acids 120-231 for frame +1 and amino acids 13-105 for frame +3). Dashes indicate the gaps between the query and the retrieved sequence.

highest E values, with high identity and a low percentage of gaps, are obviously the most likely candidates. It should be kept in mind, however, that high E values can be associated with genes encoding proteins of similar function, the highest score not always corresponding to the gene of interest.

A simple way to discriminate between two or more eventual genes encoding different proteins of the same family is to check the origin of each EST. This can be done with the UniGene server,7 which consists of a collection of human sequences, defined as clusters representing a unique gene with its map location and the corresponding ESTs. In our example, we search UniGene with the AV704087 EST, presenting the highest homology score (E = 7 x 10-39) with the yeast MRS4 protein sequence. The corresponding UniGene cluster (Hs.326104) corresponds to MRS 3/4 putative mitochondrial solute carrier, whereas the UniGene cluster Hs.300496 (mitochondrial solute carrier), which can be retrieved with the AI133696 EST, with a similar high homology score (E = 1 x 10~33), does correspond to another gene encoding a mitochondrial solute carrier. As this last EST corresponds to an already identified gene that is not the human MRS4 counterpart, AV704087 seems to be a better candidate to represent the human MRS4 counterpart. A similar search should be done for all retrieved EST sequences in order to identify their origin. In our example, 12 of the 20 first ESTs correspond to Hs.326104 and 8 to Hs.300496 UniGene clusters.

Alignments of ESTs similar to AV704087 and belonging to the Hs.326104 Unigene cluster, using the Multiple Alignment program,8 are used to confirm that they all correspond to the same gene. In our example, EST AV704087 is one of the largest ESTs (681 bp) but represents only part of the human cDNA. To possibly identify the 5' and 3' parts of this sequence, AV704087 is used as a template to identify additional human ESTs. Such identification can be done with the BLAST Search program,6 using the blastn option, which compares a nucleotide query sequence against a nucleotide sequence database, the database to be used being est_human. Several additional ESTs are thus identified that share high homology with the 3'-half sequence of AV704087. The first 15 ESTs overlap and are highly similar, some of them having a longer 3' end. Some of these ESTs are in the opposite direction (indicated by plus/minus). In this case, the sequence can be reverted by using the Reversion Complementation program from BCM Search Launcher.9 Finally, the alignment of all the sequences, using the Multiple Alignment program,8 allows construction of an EST contig. In our example, one of the ESTs (AA743110) appears to present a poly(A) tail, indicating that the complete 3' part of the cDNA should be included in this 1255-bp-long contig.

7 UniGene: http://www.ncbi.nlm.nih.gov/UniGene/index.html

8 Multiple alignment: http://www.genebee.msu.su/services/malign_reduced.html

9 Reverse complement: http://dot.imgen.bcm.tmc.edu:9331/seq-util/seq-util.html

An EST contig possibly contains several errors, as the EST sequences are only "single-pass" cDNA sequences. These sequence errors can often be corrected after comparing the EST contig with genomic sequences provided by the complete sequence of the human genome that is available in GenBank. Searching for a human genomic clone containing an EST contig can be performed by a BLAST search in the High-Throughput Genome Sequence (htgs) database. In our example, this identifies the RP11-85A1 clone (GenBank accession number AC007643, a working draft sequence) as containing the totality of the EST contig. In addition, this BLAST search also indicates the location in the genomic clone of the different exons of the EST contig. Mismatches or gaps in the alignment of the two sequences permit restoration of the most probable sequence.

The nucleotide sequence of the EST contig can now be translated into an amino acid sequence according to the three possible 5' and three possible 3' frames using the Translate Tool program.10 As in our example, the first BLAST search with the yeast MRS4 protein indicates that the AV704087 EST is 5'-3' oriented (Fig. 3), and thus only the three translations in the 5' frame must be considered. One of these amino acid sequences should contain the small amino acid sequences homologous to the yeast protein initially identified by the BLAST search (Fig. 3). In our example, the alignment of this incomplete amino acid sequence with the yeast MRS4 protein, using the Multiple Alignment program, reveals a 45.8% homology between the two sequences, with few gaps and obvious consensus sequence blocks in several regions of the proteins, including the C-terminal part (Fig. 4). The total time required to obtain this amino acid sequence is less than 10 hr of computing. However, in our example, the 5' part of the EST contig and consequently the N-terminal part of the protein sequence are missing. At this point, experimental work needs to be started to identify the 5' part of the human MRS4 cDNA. The most efficient way would be to perform 5' RACE (rapid amplification of cDNA ends) on poly(A)+ RNA. This should result in the identification of the ATG translation initiation codon. However, this search was done in December 2000; other ESTs can probably be retrieved when reading this chapter, some of them possibly containing the 5' part of the cDNA.

This procedure, which allows us to identify this EST contig, consists of performing a BLAST search in the human EST database. Another approach could be to perform this BLAST search in the nr database, instead of Human ESTs, which contains well-defined sequences, that is, complete gene, cDNA, or protein sequences from GenBank but no ESTs. This has been systematically performed for all yeast protein sequences reported in YPD.5 Indeed, information found in YPD about MRS4 indicated, in the Related Proteins section, that the yeast MRS4 presents several related proteins from different species, including human. Details of these human related proteins show that the first three human related proteins

10 Translate tool: http://www.expasy.ch/tools/dna.html

0 0

Post a comment