## Methods to Infer Domain Domain Interactions

Unfortunately, no well-established experimental method is available to detect domain-domain interactions on a large scale. In principle, one could use common interaction detection techniques, e.g., yeast two hybrid, on collections of engineered constructs expressing only a portion of the full-length protein: if a smaller construct maintains the same ability to interact as the full-length protein, one may safely conclude that the region mediating the interaction lies within the domain(s) present in the fragment protein. This approach has been successfully employed by two different groups to explore the protein interaction network of two micro-organisms, P. falciparum (Lacount et al. 2005) and H. pylori (Rain et al. 2001), by two hybrid screenings of a library of protein fragments. However, generally speaking, determining the domains mediating a protein interaction is a time-consuming task.

This prompted the development of several computational methods to identify pairs of putative interacting domains. Although many different algorithms have been devised for this purpose, so far most of them rely on the same basic assumption: if a pair of domains co-occurs in interacting protein pairs significantly more frequently than in non-interacting ones, they are likely to interact (see Fig. 4). Based on this hypothesis, statistical methods may be employed to search domain pairs recurring in interacting protein pairs. Sprinzak and Margalit (2001) scored putative interacting domain pairs by computing the log-odds of the two domains co-occurring in interacting pairs to the co-occurrence expected on a random base. Ng et al. (2003) developed a scoring system aimed at integrating the information from protein-protein interactions, multi-protein complexes and domain fusion events. The results of their predictions were stored in an online database called InterDom (http://interdom.lit.org.sg). Nye et al. (2005) adopted a rigorous statistical approach and applied a sophisticated simulation technique to assign to each pair of domain superfamilies occurring in a generic protein interaction dataset a p-value reflecting the likelihood that they are able to interact. Deng et al. (2002) developed a Maximum Likelihood Estimation (MLE) and an Expectation Maximization (EM) algorithm to infer probabilities of the domain interactions underlying a set of protein interactions; in a more recent paper (Lee et al. 2006), they extended their method by integrating interaction probabilities with information from protein fusions and Gene Ontology (Ashburner et al. 2000) functions through a Bayesian approach. Riley et al. (2005) modified the first version of the algorithm by Deng et al. (2002) and improved it by introducing the E-score, a measure reflecting the importance of a specific domain-domain interaction to explain a set of protein-protein interactions. Jothi et al. (2006) opted for a different approach, looking at the relative degree of co-evolution of domains in interacting protein pairs: they provided evidence that pairs of domains mediating the protein interaction are more likely to co-evolve with respect to non-interacting domain pairs.

Although some of the aforementioned methods show promising results, all of them are far from perfect. Since all methods invariantly require as input a protein interaction dataset, their performance is strictly dependent on the quality of the interaction data, which are often affected by high false-positive and false-negative

Fig. 4 The figure shows the basic idea underlying algorithms to infer domain-domain interactions. The algorithms need as input a protein interaction network along with information about the domain composition of the proteins present in the network. Given these data, it is possible to compute the co-occurrence frequency of domain pairs in interacting protein pairs as the ratio between the number of interacting protein pairs containing the domain pair of interest and the total number of protein pairs containing it. Also the domain frequency in the set of proteins appearing in the protein network can be calculated. Next, these frequencies can be combined to assign domain pairs an interaction probability. Panel (b) shows two different methods to rank domain pairs according to the likelihood that they interact: the fairly simple association score and the log-ratio score devised by Sprinzak and Margalit. The first method simply ranks domain pairs by their co-occurrence frequency in interacting protein pairs, whereas the second relates this value to the co-occurrence frequency that would be expected on a random base (P. represents the frequency of domain i in the proteome). Several algorithms of various complexity levels have extended this basic procedure to improve the accuracy and reliability of the predicted domaindomain interactions

Fig. 4 The figure shows the basic idea underlying algorithms to infer domain-domain interactions. The algorithms need as input a protein interaction network along with information about the domain composition of the proteins present in the network. Given these data, it is possible to compute the co-occurrence frequency of domain pairs in interacting protein pairs as the ratio between the number of interacting protein pairs containing the domain pair of interest and the total number of protein pairs containing it. Also the domain frequency in the set of proteins appearing in the protein network can be calculated. Next, these frequencies can be combined to assign domain pairs an interaction probability. Panel (b) shows two different methods to rank domain pairs according to the likelihood that they interact: the fairly simple association score and the log-ratio score devised by Sprinzak and Margalit. The first method simply ranks domain pairs by their co-occurrence frequency in interacting protein pairs, whereas the second relates this value to the co-occurrence frequency that would be expected on a random base (P. represents the frequency of domain i in the proteome). Several algorithms of various complexity levels have extended this basic procedure to improve the accuracy and reliability of the predicted domaindomain interactions rates. Another major issue is validation: how can the reliability of predictions be assessed? Accurate estimation of the algorithm's performance would recommend the comparison of the output predictions with a reference set of trusted positives and negatives. The definition of such a reference set is greatly impaired by the scarcity of known interacting and non-interacting domain pairs. It is common practice to consider as true interacting domain pairs those that have been observed to interact at least in one solved three-dimensional structure and as true non-interacting pairs those that never come into contact in known structures. Given the small number of non-redundant three-dimensional structures contained in the PDB

(Protein Data Bank, Berman et al. 2000), many true interacting domain pairs may not be represented at all in the positive set, whereas the number of non-interacting pairs in the negative set is likely to be overestimated. Accuracy of the reference set should also be questioned: in many cases it is hard to tell whether two residues come close because a biologically meaningful interaction between the two has occurred or merely due to crystal packing. For this reason, Shoemaker et al. (2006) have recently developed a strategy, based on structural criteria, to discriminate biologically relevant protein domain interactions from artifactual ones.

## Post a comment