Applied Databases and Methods

In our more recent work two data sets were used for analysis and for calibrating the prediction method.17 First, a database of 81 representative protein alignments was created in such a way that in each alignment at least one sequence had its structure solved. The protein structures were selected from the Protein Data Bank (PDB)18 by a two-step procedure. First, the sequences of all PDB proteins longer than 50 residues and having a crystallographic resolution better than 2.5 A were compared by calculating the correlation coefficient of dipeptide frequencies. A set of 101 proteins remained after requiring that any pair should have a dipeptide

17 A. Fiser and I. Simon, Bioinformatics 47, 251 (2000).

18 E. E. Abola, F. C. Bernstein, S. H. Bryant, T. F. Koetzle, and J. Weng, Protein Data Bank, in "Crystallographic Databases—Information, Content, Software Systems, Scientific Applications" (F. H. Allen, G. Bergerhoff, and R. Sievers, eds.), p. 107. Data Commission of the International Union of Crystallography, Bonn, Germany, 1987.

frequency correlation smaller than 0.4. In the second step, every pair of proteins in the filtered set was compared by a rigorous sequence comparison method19'20 followed by cluster analysis, which yielded the final 81 proteins. The four-letter PDB codes and chain identifiers are as follows: 155C, 1ACX, 1ALC, 1BBPA, 1CC5, 1ECA, 1FKF, 1FNF, 1FNR, 1GCR, 1GPLA, 1HDSB, 1HIP, 1HOE, 1LRD4,1PAZ, 1PCY, 1PHH, 1PRCC, 1RBP, 1RHD, 1RNH, 1SN3, 1TGS, 1TPKA, 1WSYB, 256BA, 2ALP, 2AZAA, 2CAB, 2CD4, 2CDV, 2CPP, 2FXB, 2GN5, 2LH7, 2LIV, 2LTNA, 20RLL, 2PABA, 2RNT, 2RSPA, 2SECI, 2SNLE, 2SNS, 2SODB, 2SSI, 2STV, 2TSL, 2UTGA, 3ADK, 3B5C, 3CLA, 3FXC, 3GAPB, 3LZM, 3SGB1, 451C, 4BP2, 4FDL, 4FXN, 4HHBA, 4PEP, 4PFK, 4PTP, 4TNC, 5CTS, 5CYTR, 5EBX, 5RUBA, 5RXN, 6LDH, 6TMNE, 7PTI, 8ADH, 8ATCB, 8CATA, 8DFR, 9PAP, 9RSAA, and 9WGAA. Of the protein set, 51% contained only free cysteines, 27% contained only half-cystines, 5% contained both forms of cysteine, and 15% contained neither form. By chance the number of half-cystines (plus li-ganded ones) and free cysteines in the set turned out to be equal: 148 half-cystines (plus 9 liganded cysteines) and 157 free cysteines. Each of these sequences was compared with the Protein Information Resource (PIR) database21 by the program SCANPS ( Sequences that gave a probability lower than 1CT6 were used to produce the multiple sequence alignments by the method of Barton and Sternberg.22 The number of sequences in the alignment varied between 3 and 499, with a median of 28. Another, less strictly selected data set of proteins was used (including lower resolution X-ray structures with a crystallographic R factor less than 25%) to confirm some observations made with the smaller set. This larger set contained 233 proteins: 161 (69.1%) had only free cysteines, 24 (10.3%) had only half-cystines, 33 (14.2%) had neither, and 15 (6.4%) contained both forms of cysteine.

Accessible surface areas were calculated by the program DSSP23 and converted to relative accessibilities by dividing by the accessibility of the residue in a Gly-X-Gly tripeptide.24 Two relative accessibility classes were considered: buried (A < 0.25) and exposed (A > 0.25).

Conservation scores based on the physicochemical properties of the amino acids were calculated for each position in each alignment according to Livingstone and Barton.25 Such conservation scores range from 0 to 10 and count the number of the properties shared at a position, where the properties are as follows: Hydrophobic, Positive, Negative, Polar, Charged, Small, Tiny, Aliphatic, Aromatic, and Proline, and their negation (e.g., not Hydrophobic). For each position

19 T. F. Smith and M. S. Waterman, J. Mol. Biol. 147, 195 (1981).

21 D. G. George, W. C. Barker, and L. T. Hunt, Nucleic Acids Res. 14, 11 (1986).

22 G. J. Barton and M. J. E. Sternberg, J. Mol. Biol. 198, 327 (1987).

23 W. Kabsch and C. Sander, Biopolymers 22,2577 (1983).

24 G. Rose, A. Geselowitz, G. Lesser, R. Lee, and M. Zehfus, Science 229,834 (1985).

25 C. D. Livingstone and G. J. Barton, Comput. Appl. Biosci. 9,745 (1993).

in each protein this score was then divided by the average conservation of the protein to give a relative conservation score Cr. We refer to a position as "conserved" if Cr > 1, that is, the conservation of the given position is higher than the average conservation of the sequence.

0 0

Post a comment