Bioactivity profiling

It is evident that the effectiveness of screening libraries will be increased if the compounds contained within the library have 'drug-like' properties. However, it is difficult to define clearly the concept of 'drug-likeness' in terms of the exact characteristics that a molecule should have in order to be viable as a drug. Rather, biological activity is known to be the result of a complex range of different characteristics such as lipophilicity, flexibility, hydrogen bond donating ability, etc. Despite these difficulties in characterising 'druglike' molecules there are some general criteria that can be applied during combinatorial library design and compound acquisition programmes to filter out undesirable compounds. For example, eliminating high molecular weight compounds, compounds that contain reactive groups which may interfere with the intended reaction, or lead to toxic or unstable products, and highly flexible molecules. Another example of a filtering technique is the well known 'Rule-of-five' developed by Lipinski et al. [10]. The rule is based on easy-to-calculate properties that are designed to identify compounds that are likely to exhibit poor intestinal absorption.

We have developed a knowledge-based approach for estimating the likeli-

Table 1. SMARTS definitions used to identify the presence of hydrogen bond donors (HBD), hydrogen bond acceptors (HBA) and rotatable bonds (RB) within a molecule

Feature SMARTS

HBA [$([!#~;+0]);!$([F,C1,Br,I]);!$([o,s,nX3]);!$([Nv5,Pv5,Sv4,Sv6])]

hood of a molecule exhibiting bioactivity [11]. Weights are derived that can be used to score and rank compounds so that if the compounds are screened in rank order the active molecules should be found more rapidly than if they are screened at random. Thus, the method can be used to order compounds for screening and to guide compound acquisition programmes. We characterise activity by analysing what is currently known about bioactive molecules, and we define a molecule as likely to be bioactive if it has characteristics that are similar to known bioactive molecules. A limitation of the method is that it clearly biases compounds towards those that have been shown to exhibit activity in the past; however, given that we still have much to learn about the structure-activity relationships that exist in known areas of biological activity, we believe this to be a valuable 'data mining' tool.

A training set, consisting of molecules in two different classes, is used to derive weights that are then used to score and rank molecules. A molecule is scored according to the extent to which its properties are typical of all the molecules in the class in which it belongs. For example, weights can be derived that discriminate between 'drug-like' and 'non-drug-like' compounds and the molecules are ranked according to their likelihood of exhibiting activity in any therapeutic area. When choosing compounds for screening against particular therapeutic targets, weights can be derived that are specific for a given therapeutic area, for example, CNS activity.

The weights are based on easy-to-calculate physicochemical properties such as molecular weight (MW), number of rotatable bonds (RBs), number of hydrogen bond donors (HBDs), number of hydrogen bond acceptors (HBAs), number of aromatic rings (ARs), 2kk shape descriptor [12], and ClogP [13]. In principle, any other easy-to-calculate properties could also be used. The MW, AR and 2Ka shape index features used in the experiments described below were calculated using the Daylight toolkit [14]. A hydrogen bond donor is defined as any heteroatom that carries at least one hydrogen, and a hydrogen bond acceptor is defined as a heteroatom with no positive charge, excluding the halogens, aromatic oxygen, sulphur and pyrrole nitrogen and the higher oxidation levels of nitrogen, phosphorus and sulphur. Note that an atom can be considered as both a donor and an acceptor. The SMARTS definitions of these substructural features are given in Table 1.

The distribution of each feature in a set of compounds is represented by a set of bins with a total of 20 bins per feature. The structural features HBD, HBA, RB and AR are represented by counts and the bin size is set to one. Thus, for HBDs, the first bin represents the number of molecules in the database that have no donors, the second bin represents the number of molecules with exactly one donor, and so on, with the final bin representing the number of molecules with 19 or more donors. The physicochemical properties are also represented by bins, but in these cases the bins represent ranges of values. For example, the first bin for 2Ka, represents the number of molecules with 2Ka values in the range 0.00-1.99, the second bin represents the number of molecules with values in the range 2.00-3.99, and so on. The bins representing the distribution of MW have a range of 75, so that the bins represent the ranges 0.00-74.99, 75.00-149.99, . . . and > 1425.00. Weights are then assigned to each of the bins at random and a genetic algorithm (GA) [15] is used to derive optimum weights that maximise the discrimination between two classes of compounds.

The chromosomes of the GA are integer strings that map directly to the weights. The standard genetic operators of crossover and mutation are used to generate child chromosomes. The fitness function of the GA measures the extent to which the weights contained in a chromosome can be used to discriminate between two classes of molecules, merged within the training set. Each molecule in the training set is scored by summing weights over all the features where the weight for an individual feature is determined by the value of that feature within the molecule. The molecules are then ranked according to decreasing score and the fitness function is calculated as the average ranked position of molecules in the preferred set (for example, the set of active compounds). Thus the GA attempts to shift the distribution of scores in one class of molecules relative to the other class in order that maximum separation between the two distributions is achieved.

The method has been applied to the discrimination of drugs and non-drugs as represented by the World Drugs Index (WDI) [16] and the SPRESI database [ 17], respectively. The databases were preprocessed as follows: the molecules were restricted to those that contain the elemental types: C, N, O, F, P, S, C1, Br, and I; those with molecular weight in the range 100 to 1000; only parent compounds were included in the case of salts; and where possible charges were neutralised by altering the number of hydrogens. Adjusting the charges ensures that the molecules are treated consistently with respect to pH and it also allows simpler definitions of hydrogen bond donors and acceptors to be used. SPRESI was further processed by removing the compounds that occur in WDI, and then selecting a 16 661-member random sample. Previous experiments showed that the subset of SPRESI is representative of the whole database [11]. It is assumed that the remaining SPRESI compounds represent inactive molecules. In practice, of course, there may well be SPRESI molecules that have not yet been identified as potential active molecules but the percentage of these is assumed to be negligible. (The fact that drug companies typically screen 10 000 molecules to find a novel lead compound implies that drug activity is a rare event and therefore the chance of finding active compounds in SPRESI is low.)

WDI was further processed by analysing the activity classes assigned by Derwent. Molecules were removed as follows: molecules with no activity class assigned, molecules that are labelled as 'trial-prep' and molecules that belong to the following activity classes: pesticides and plant hormones (except for fungicides), zootoxins, toxins, surfactants, diagnostics, chelators and adsorbents. It is assumed that the remainder of WDI represents a wide variety of active molecules and that it is not biased towards any particular class(es) of compound(s), although an inspection of its contents suggests that at least some classes, such as antimicrobials, are overly represented. We then selected a random sample of 1 000 compounds.

The features (number of HBDs, HBAs, RBs, AR, MW, and 2Ka, and ClogP) were calculated for each of the molecules in SPRESI and WDI. The GA was run to minimise the average position of WDI molecules once the molecules had been scored and ranked. The distributions in Figure l show a clear separation between the two classes of compounds. In terms of a screening experiment, these results indicate that screening 11% of the total set of compounds would result in the extraction of 50% of the WDI compounds, as shown in Figure 2. Figure 3 shows the results of applying the weights in a predictive manner, that is to 10 000 WDI and 166 610 SPRESI compounds. It can be seen that the weights are also effective when applied to previously unseen compounds.

Extensive experiments have already been reported [11] that demonstrate that the method is even more effective when applied to discriminate specific therapeutic classes from inactive compounds: for example, the effectiveness of the method at discriminating compounds belonging to the class of antibiotics from SPRESI compounds is shown in Figure 4. The method can also be used to identify compounds in a given therapeutic class from within 'druglike' collections. For example, Figure 5 illustrates the effect of training the GA to discriminate between compounds within the class of antibiotics and compounds in other therapeutic classes within WDI.

A similar approach to that described here has been adopted at Glaxo-

Figure 1. The distribution of scores for compounds in WDI, in black, is shown superimposed on the distribution of scores in a subset of the SPRESI database. The y-axis represents the percentage of compounds, the x-axis represents the score.

Wellcome for the selection of a corporate screening set of compounds [18]. Similar methods have also been developed more recently by Ajay et al. [19], Sadowski [20] and Wagener and van Geerestein [21]. The methods differ in the algorithms used to discriminate between compounds (for example, neural networks and recursive partitioning are used in place of the GA), the descriptors that are used to represent the compounds, and in the data to which the methods are applied. However, broadly similar results are found in all cases. A GA has the potential advantage over neural networks that the weights that are optimised are visible (in neural networks the weights are hidden). Thus, in theory the weights produced by the GA are interpretable. However, further analysis is required in order to interpret the weights found here since a single set of weights is generated that encompasses information from a whole range of structures and since no account is taken of the co-occurrence of features within a molecule.

0 0

Post a comment