[47 Expressed Sequence Tag Database Screening for Identification of Human Genes

By Agnes Rotig, Arnold Munnich, and Pierre Rustin Introduction

Because the identification of human genes makes possible a better understanding of physiological and pathological processes, it represents one of the main goals of human geneticists. It has long been true that working at the bench was the only way to reach this goal, but more recently the enormous amount of data resulting from the human sequence project and the development of computing capabilities have revolutionized the field. New approaches and new tools are now routinely used for gene identification. A first useful resource is the development of the Expressed Sequence Tags database (EST; as of January 20, 2001 there were more than 2.9 x 106 ESTs), generated from a large variety of human tissues and representing a large number of human genes. On the other hand, the availability of the complete human genome sequence will presumably boost the identification of a large number of human genes. Indeed, a huge amount of information about the human genome sequence is now available in several databases and can be readily used for the identification of human genes. Finally, the increasing number of online computing programs allows these sequence databases to be easily exploited. Nevertheless, it is worth remembering that a nucleotide sequence picked up in a human sequence database is generally not sufficient per se to determine a cDNA or gene sequence and that further experimental work must be carried out for each particular gene.

When possible, the quickest approach to identify a human gene is often to use cross-species comparison by computer analysis. Because a large number of biochemical functions or metabolic pathways have been conserved during evolution, protein and/or nucleotide sequences are often, in part, highly conserved. Thus the knowledge acquired about model organisms such as the yeast Saccharomyces cerevisiae, the fly Drosophila melanogaster, or the nematode Caenorhabditis elegans, is of great help in identifying genes in other species, especially in humans. Such in silico cloning, allowing the identification of a yet unknown human gene, takes only a few hours compared with tedious and day-consuming library screening or functional complementation. After identifying part of the searched sequence, experimental work is still to be done to check this in silico cloning and reconstitute the complete sequence.

We describe below the different steps involved in in silico cloning, using several databases and programs. A standard personal computer connected to the Internet is the only required material. This strategy is facilitated by the use of a molecular biology server such as the Deambulum server1 which allows rapid connection to several databases and sequence analysis programs (Fig. 1). Almost all the databases and programs mentioned below can be easily accessed through this server. Alternatively, each database can be accessed by entering their particular location (Web address indicated as footnotes), using an Internet navigator.

0 0

Post a comment