## Theory and methods

In the following, we shortly summarize the theoretical background of our approach (for a detailed description see [36]).

The noncovalent complex formation between a ligand and a protein is usually performed in aqueous solution. An implicit description of the complex solute-solvent interactions and entropic solvent effects together with the involved enthalpic contributions resulting from interatomic forces (e.g. electrostatic or van der Waals) [39] is reflected by the formalism used to derive potentials of mean force from database knowledge [23,40].

Derivation of statistical distance-dependent pair-preferences and solvent-accessible suface-dependent singlet-preferences

Following an approach at atomic level [41,42], distance-dependent pair-potentials between ligand and protein atoms of type i and j are compiled by where gi,j (r) is the normalized radial pair-distribution function for atoms of types i and j, separated by a distance in the interval of r and r +dr; g(r) is the normalized mean radial pair distribution function for a distance between two atoms in the range r and r + dr. It incorporates all nonspecific information common to all atom pairs present in an environment typical for proteins.

The definition of an upper radius limit rmax for interactions between atoms i andj [43] determines the overall shape ofthe resulting potentials. Sampling over short distances will emphasize the specific interactions formed by a lig-and functional group with the neighboring binding-site residues. To guarantee that these interactions will dominate, we restrict our sampling to distances between 1 and 6 Á, with a bin size of 0.1 Á. The rationale for this upper limit arises from the fact that a 6 Á contact is short enough not to involve a water molecule as mutual mediator of a ligand-to-protein interaction. To avoid the sampling over large distances but to include solvent-mediated effects, an alternative approach is required [42,44]. A combination of the short-distance sampling together with the findings from protein-fold prediction motivated us to derive a knowledge-based one-body potential scaled to the size of the solvent accessible surface (SAS) of the protein and the ligand that becomes buried upon complex formation:

In this equation, gt is the normalized distribution function of the surface area of an atom i in the buried state (SAS ) (considering ligand and protein individually) in comparison to the solvated state (SAS0). It is calculated by an approximate cube-algorithm similar to the one introduced by Böhm [15]. In this assumption any polar portion of the SAS that becomes buried in the complex, however still facing a polar environment, is considered to remain in a condition equivalent to 'solvent accessible' [45].

As a first approximation, the ligand conformations found by X-ray crystallography or docking procedures are assumed to be identical to those adopted in the solvent. In this crude model, conformational changes experienced upon ligand binding [46] are not considered.

Both short-range pair- and SAS-potentials are derived using the ReLiBase system [37] for data extraction. For our purpose, we evaluated crystallograph-ically determined complexes with resolutions better than 2.5 Ä. Complexes with covalently bound ligands or ligands with less than 6 or more than 50 non-hydrogen atoms were discarded. Furthermore, we excluded all complexes that were subsequently used in the validation of the predictive power of the potentials to avoid any redundancy or training effects due to overfitting. Potentials were derived for 17 different atom types using the SYBYL atom type notation: C.3, C.2, C.ar, C.cat, N.3 (= N.4), N.ar (= N.2), N.am, N.pl3, O.3, O.2, O.co2, S.3 (= S.2), P.3, F, C1, Br including metal atoms Met (= Ca, Zn, Ni, Fe).

Calculation of the total score for a given ligand pose

In our approach we assume that a reasonable description of the total preference A W of a particular binding geometry can be approximated by summing over all individual contributions (i.e. of ki ligand atoms of type i and lj protein atoms of typej).

Y is an adjustable parameter, optimized empirically to be 0.5.

Our approach does not incorporate explicitly additional contributions to the binding energy such as conformational, rotational, and translational entropy. Furthermore, energy contributions arising from intramolecular interactions (van der Waals and torsion potentials) are neglected. Since popular docking tools such as FlexX, DOCK, and GOLD generate only favorable ligand conformations, we believe that these terms can only be of minor importance in comparison to the solute state contributions.

The obtained scoring values are taken to rank different poses of one ligand in a single protein with respect to the rms deviation from the geometry as found in the crystal structure.

### Calculation of binding affinities

As stated above, the obtained statistical preferences and in consequence the calculated scores are considered to implicitly contain not only enthalpic but also entropic contributions to binding. Although not a proof, but if the derived scoring values correlate with experimentally determined binding free energies, it appears evident that the important contributions are correctly and sufficientlycovered.

For the calculation of scoring values, cofactors and metal atoms are taken as part of the protein whereas water molecules are omitted. The values obtained are related to the experimentally determined binding affinities using an adjustable parameter cs. It is determined iteratively by scaling A W in a way that the standard deviations of these calculated values become equal to the observed ones (pKi) according to a straight line with zero intercept.

The general applicability of this relationship depends on the fact whether this adjustable parameter can be transferred among different data sets.

Implicit consideration of directionality in radial-symmetrical pair-potentials

The strength of interactions, especially between polar functional groups but also e.g. between aromatic rings, depends on their mutual distance and relative orientation in space. Since a single pair-potential purely exhibits spherical symmetry, the question is how well directionality between pairs of atoms is implicitly reflected if multiple pair-potentials are considered as a composite representation.

To assess the spatial resolution of the entire ensemble of the pair potentials, a cubic grid with 0.5 A, spacing is constructed in the binding pocket with a margin of 8 A, around a ligand. At every grid point not occupied by the protein, a scoring value is calculated considering all possible ligand-atom types (Equation 1). The obtained grid values are contoured for the individual ligand-atom types. The isopleths shown comprise 10% of the potential values above the global minima for each type. For a statistical evaluation, the type of a solvent-inaccessible ligand atom actually found in the analyzed crystal structure is compared to the type predicted by our function in this local area. For the analysis, the scoring values for a C.3, N.3, O.3, O.2, or O.co2 probe at the neighboring grid point are compared and the probe with the best scored value is selected. We decided to use a 0.5 A grid because the largest possible distance between a ligand atom and its nearest grid point amounts to half of the through-space diagonal (~ 0.43 A). We believe the grid is detailed enough, since this distance is close to the mean positional errors in experimental structure determinations.

In principle a scoring value can be calculated by DrugScore at any position in space, also where precisely a ligand atom is found in a crystal structure. However, with the motivation to predict favorable ligand atom sites inside a binding pocket de novo, the grid-based approach appears most appropriate.

## Post a comment