Library design in product space

The design of a combinatorial library experiment involves identifying pools

Figure 2. The number of WDI compounds found over intervals of the ranked list. In this simulated screening experiment 50% of the drug-like compounds are found by screening 11% of the compounds. The black horizontal line shows the rate at which the WDI compounds would be found if they were distributed at random throughout the list.

Figure 2. The number of WDI compounds found over intervals of the ranked list. In this simulated screening experiment 50% of the drug-like compounds are found by screening 11% of the compounds. The black horizontal line shows the rate at which the WDI compounds would be found if they were distributed at random throughout the list.

of available reactants, for example, by searching in-house databases and the Available Chemicals Directory [22]. Experience shows that potential react-ants for most combinatorial syntheses are readily available in numbers that greatly exceed the capacity of current screening procedures. Some reactants can be eliminated from further consideration through the use of filtering techniques that remove those with undesirable characteristics, using approaches analogous to those discussed in the previous section. However, the number of available reactants that remain often still exceeds capacity and hence it is necessary to apply selection techniques in order to reduce the sizes of the reactant pools to manageable sizes. In most approaches to library design, subset selection is applied at the reactant level and a product library is then synthesised from the chosen reactants. However, although this approach is computationally appealing, a limitation is that optimising the reactants, for example, according to structural diversity, does not imply an optimised set of products [23].

An alternative approach involves enumerating the full virtual combinatorial library and performing the selection at the product level. Any of the methods developed for reactant-based selection can be applied at the product level

0 11 22 34 45 56 67 79 90 100 Percentage Intervals in Ranked List

Figure 3. The results of applying the weights predictively are shown superimposed on the results found for the training compounds. The training set consists of 1 000 WDI compounds and 16 661 SPRESI compounds. The predicted sets consist of 10 000 WDI compounds and 166 610 SPRESI compounds. The dashed line represents the results that would be expected if the WDI compound were distributed at random throughout the ranked list.

0 11 22 34 45 56 67 79 90 100 Percentage Intervals in Ranked List

Figure 3. The results of applying the weights predictively are shown superimposed on the results found for the training compounds. The training set consists of 1 000 WDI compounds and 16 661 SPRESI compounds. The predicted sets consist of 10 000 WDI compounds and 166 610 SPRESI compounds. The dashed line represents the results that would be expected if the WDI compound were distributed at random throughout the ranked list.

in a process known as 'cherry-picking'. However, although this approach can lead to a subset of products that are optimally diverse, it is synthetically inefficient when mapped to a combinatorial library experiment since no account is taken of the combinatorial constraint. That is, there is no guarantee that each reactant from one pool occurs in a product with each reactant from a second pool. For example, we conducted an experiment to cherry-pick 1 600 diverse products from a 400 x 400 virtual amide library and found that maximising the structural diversity required no less than 137 different amines and 146 different carboxylic acids [23]. The systematic joining of all these amines to all of the carboxylic acids as performed in practical combinatorial synthesis would result in 19 992 molecules, of which the 1 600 most diverse molecules are a subset. The synthetic inefficiency of performing selection at the product level by cherry picking has also been noted by Cribbs et al. [24]. In their work, nearly all of the reactants were required in order to build the selected molecules.

We have developed a program called SELECT [23,25] that performs product-based selection taking direct account of the combinatorial constraint.

Figure 4. The distribution of scores for antibiotics, in black, is shown superimposed on the distribution of scores in a subset of the SPRESI database. The y-axis represents the percentage of compounds, the x-axis represents the score.
Figure 5. The distribution of scores for antibiotics, in black, is shown superimposed on the distribution of scores in WDI, with the antibiotic compounds removed. The y-axis represents the percentage of compounds, the x-axis represents the score.

That is, SELECT can be used to design combinatorial subsets that are by definition synthetically efficient and that are optimised with respect to diversity and other user-defined properties. The diversity of libraries designed using SELECT can be measured using different descriptors, for example, Daylight [14] and UNITY [26] fingerprints and Molconn-Z parameters [27], and different diversity metrics, for example, the sum-of-pairwise dissimilarities and average nearest neighbour distance. SELECT [25] is based on a genetic algorithm and uses a multi-objective fitness function that allows many properties to be optimised simultaneously with diversity. Thus, the physi-cochemical property profiles of libraries can be optimised in the design of diverse and 'drug-like' libraries. SELECT can also be used to design libraries that complement existing libraries and to explore different library configurations.

In a previous study [23] we investigated the effectiveness of product-based selection of a combinatorial library relative to reactant-based selection. Our experiments considered selecting a range of different sized subsets from three different combinatorial libraries using Daylight fingerprints as descriptors and a single diversity metric, the sum-of-pairwise dissimilarities using the cosine coefficient. We used dissimilarity-based compound selection (DBCS) [28] to select diverse reactants that were then enumerated into a product library and the diversity of the library was measured using the sum-of-pairwise dissimilarities. Diverse combinatorial subsets were then selected from the full virtual libraries using SELECT and their diversities were compared with the analogous libraries selected by analysing reactant space. Our experiments demonstrated that choosing reactants through an analysis of product space results in significantly more diverse libraries than if the selection is made at the reactant level, using the chosen combination of descriptor (Daylight fingerprints) and diversity metric (the sum-of-pairwise dissimilarities). In fact, we found that the product-based combinatorial libraries were intermediate in diversity between reactant-based selection and cherry picking in product space; however, they have the considerable advantage over cherry picking in that the subset libraries are themselves combinatorial libraries and hence amenable to efficient synthesis.

We report here a more extensive series of experiments to determine the effectiveness of product-based selection relative to reactant-based selection for a number of different descriptors and diversity metrics. Specifically, we have investigated the effectiveness of product-based combinatorial library design considering three different descriptors, three different diversity metrics and two different libraries. All calculations were made using the SELECT program. The descriptors are 1024 bit Daylight fingerprints [14], 992 bit UNITY fingerprints [26] and 538 Molconn-Z parameters [27]. The Molconn-

Z parameters are real numbers that have been standardised to fall in the range 0. . . 1. The diversity metrics are the sum-of-pairwise dissimilarities calculated using the cosine coefficient (and implemented using the O(N) centroid algorithm [29]), SUMC0J, the sum-of-pairwise dissimilarities using the Tan-imoto coefficient, SUMtan, and the average nearest neighbour distance using the Tanimoto coefficient, NN.

SUMC0J for a library of N molecules is defined as:

where COS(J,K) is the similarity between molecules J and K defined using the cosine coefficient; SUMtan for a library of Nmolecules is defined as:

where TAN(J, K) is the similarity between molecules J and K defined using the Tanimoto coefficient; and NN for a library of Nmolecules is defined as:

where 1 - TAN(J, K) is the distance between molecule J and molecule K and MIN ( 1 - TAN(J,K)) is the distance from molecule J to its closest neighbour.

The experiments were carried out as follows. In reactant-based selection, SELECT was used to choose diverse subsets of reactants of specified sizes from each of the reactant pools independently, the reactants were then enumerated to form a product library and its diversity was measured using the same metric as was used to select the reactant subsets. (SELECT can be used for reactant-based selection by setting the number of components in the library to one.) In product-based selection, SELECT was used to find an optimised combinatorial subset directly. In each experiment, the descriptors used and diversity metrics to be optimised were the same for the reactant-based selection and the product-based selection and the resulting libraries were of the same size and configuration.

The first library to be investigated was a two-component library amide library where amines in one pool are reacted with carboxylic acids from another to form amides. A virtual library was built using 100 amines and 100 carboxylic acids, the reactant pools each being formed by extracting structures at random from SPRESI [17]. Daylight fingerprints, UNITY fingerprints and Molconn-Z parameters were calculated for each of the reactants u

0 0

Post a comment