Rncs R2nh2

Figure 6. The thiazoline-2-imine library.

Table 2. Reactant-based versus product-based diversities for amide libraries selected using Daylight fingerprints as descriptors. The column headed 'Min' gives the diversity calculated when SELECT was run to find combinatorial subsets with minimum diversity. The final column gives the percentage difference in diversity between product-based and reactant-based selection relative to the range of values possible (given by subtracting the Min diversity from the Product diversity)

Table 2. Reactant-based versus product-based diversities for amide libraries selected using Daylight fingerprints as descriptors. The column headed 'Min' gives the diversity calculated when SELECT was run to find combinatorial subsets with minimum diversity. The final column gives the percentage difference in diversity between product-based and reactant-based selection relative to the range of values possible (given by subtracting the Min diversity from the Product diversity)

Metric Size

Reactants

Products

Min

% Diff

SUMcos 900 (30 x

30)

0.565 (0.002)

0.586 (0.002)

0.356

9.4

400 (20 x

20)

0.567 (0.001)

0.595 (0,001)

0.305

9.6

100 (10 x

10)

0.560 (0.001)

0.601 (0.001)

0.227

10.8

0.715 (0.002) 0.744 0.717 (0.004) 0.750 0.706 (0.006) 0.747

(0.002)

0.522

12.5

(0.001)

0.462

11.4

(0.002)

0.362

0.253 (0.003) 0.305 0.270 (0,008) 0.347 0.315 (0.003) 0.405

(0.001)

0.045

20.1

(0.004)

0.034

24.4

(0.005)

0.019

27.8

in the reactant pools. Next, the full library of 10 000 amides was enumerated, and the descriptors were calculated for each product molecule.

The second library was a three-component library that is based on a thiazol-ine-2-imine template [30] and the reaction is shown in Figure 6. The R1 reactants are isothiocyanates; the R2 reactants are amines; and the R3 re-actants are haloketones. Reactants for each pool were extracted at random from SPRESI. The pools consisted of 10 isothiocyanates; 40 amines; and 25 haloketones, representing a fully enumerated virtual library of 10 000 thiazoline-2-imines. Daylight fingerprints, UNITY fingerprints and Molconn-Z parameters were calculated for each product molecule in the enumerated virtual library and for each reactant in the reactant pools.

Table 2 shows the results for selecting amide libraries of various sizes using reactant-based selection and product-based selection for Daylight finger-

Table 3. Reactant-based versus product-based diversities for amide libraries selected using UNITY fingerprints as descriptors. The final two columns are as described for Table 2

Metric

Size

Reactants

Products

Min

% Diff

SUMCOS

900 (30 x

30)

0.552 (0.002)

0.566 (0.002)

0.339

5.9

400 (20 x

20)

0.569 (0.001)

0.584 (0.003)

0.302

5.2

100 (10 x

10)

0.576 (0.005)

0.601 (0.001)

0.226

6.7

SUMtan

900 (30 x

30)

0.715

0.727

0.507

5.5

400 (20 x

20)

0.717 (0.003)

0.737 (0.002)

0.470

7.4

100 (10 x

10)

0.727 (0.004)

0.746 (0.001)

0.364

5.0

NN

900 (30 x

30)

0.243

0.294

0.045

20.5

400 (20 x

20)

0.272 (0.009)

0.333 (0.006)

0.028

19.6

100 (10 x

10)

0.297 (0.005)

0.399 (0.003)

0.014

26.3

Table 4. Reactant-based versus product-based diversities for amide libraries selected using normalised Molconn-Z parameters as descriptors. The final two columns are as described for Table 2

Metric

Size

Reactants

Products

Min

%Diff

SUMcos

900 (30 x

30)

0.278 (0.001)

0.288 (0.000)

0.121

6.5

400 (20 x

20)

0.294 (0.001)

0.308 (0.001)

0.104

7.8

100 (10 x

10)

0.315 (0.002)

0.332 (0.001)

0.076

5.1

SUMtan

900 (30 x

30)

0.451

0.470

0.217

7.5

400 (20 x

20)

0.474 (0.002)

0.492 (0.001)

0.182

5.8

100 (10 x

: 10)

0.488 (0.005)

0.513 (0.001)

0.136

6.9

NN

900 (30 x

30)

0.107

0.150

0.036

37.7

400 (20 x

20)

0.128 (0.003)

0.179 (0.003)

0.031

33.1

100 (10 x

10)

0.147 (0.007)

0.232 (0.002)

0.023

42.9

prints for the three diversity metrics, SUMcos SUM^« and NN. Tables 3 and 4 show similar results for UNITY fingerprints and Molconn-Z parameters, respectively. In general, the results are based on average diversities and standard deviations (given in brackets) over five runs, except for some of the runs using the SUMt» and NN metrics. These metrics are O(N2) in complexity, unlike the SUMccs metric which is O(N), and insufficient computing resources were available to allow repeated runs. In each case it can be seen that product-based selection is more effective in selecting diverse libraries than is reactant-based selection.

We have considered a number of different ways of quantifying the differences in diversity values, since the absolute values are related to the particular descriptors and diversity metrics used. One way might be to determine the degree of overlap between the subsets generated by reactant-based selection as compared to product-based selection on the assumption that the greater the difference between the subsets, the greater is the difference in effectiveness between the two methods. However, since both reactant-based and product-based selection are non-deterministic (the subsets are selected using a GA), different runs of the algorithm can produce difference results. It has been our experience that whereas the final diversity measure does not vary greatly from one run to another, as evidenced by the low standard deviations in the tables, the exact composition of the subsets can vary. In other words, there are many different subsets that give the same near maximal diversity and subsets having the same diversity can have a relatively small degree of overlap. Thus the degree of overlap between sets cannot be used to quantify a difference in diversity.

The way we have chosen to quantify the difference in diversities is to calculate the percentage change in diversity based on the range of diversity values that are possible for a given subset size. When measuring the similarity, or dissimilarity, between two compounds the possible values are in the range 0. . . 1; however, the range of values possible for a diversity metric such as the sum-of-pairwise dissimilarities falls in a much smaller range. In earlier work [23], we demonstrated that a GA is able to find near-optimally diverse subsets when operating in cherry-picking mode. If we assume here that SELECT is able to find the global maximum diversity for combinatorial subsets selected from a combinatorial library, then we can also use SELECT to find the global minimum by minimising the diversity of the subsets chosen. The columns headed Min in Tables 2 to 4 report the minimum diversities found over 5 runs (except for a few cases using the SUM tan and NN diversity metrics when the calculation was performed once only, see above) when product-based selection is performed to select the subset with minimum diversity. The final column then gives the difference between product-based

Table 5. Reactant-based versus product-based diversity for a thiazoline-2-imine library using Daylight fingerprints as descriptors. The final two columns are as described for Table 2

Metric

Size

Reactants

Products

Min

% Diff

SUMcos

900 (6 x 10 x 15)

0.394

0.424

0.303

24.8

400 (4 x 10 x 10)

0.389

0.420

0.272

21.0

100 (2 x 10 x 5)

0.362

0.406

0.221

23.8

SUM™

900 (6 x 10 x 15)

0.563

0.594

0.455

22.3

400 (4 x 10x 10)

0.552

0.589

0.424

22.4

100 (2 x 10 x 5)

0.514

0.574

0.345

26.2

NN

900 (6 x 10 x 15)

0.151

0.204

0.051

34.6

400 (4 x 10 x 10)

0.167

0.232

0.042

34.2

100 (2 x 10 x 5)

0.208

0.289

0.027

30.1

and reactant-based selection calculated as a percentage of the possible range of values (the minimum value subtracted from the maximum value found using product-based selection).

The results for the thiazoline-2-imine library using Daylight fingerprints and the three diversity metrics are shown in Table 5. The minimum diversity possible and the percentage differences in diversity for product-based selection versus reactant-based selection were calculated as for the amide libraries. The results over all descriptors and all diversity metrics for both libraries are summarised in Table 6 for selecting 900-member subset libraries of configuration 30 x 30 for the amide libraries and 6 x 10 x 15 for the thiazoline-2-imine libraries (6 isothiocyanates, 10 amines and 15 haloketones). It can be seen that product-based selection is more effective in all cases. The effect is more pronounced over all the descriptors and metrics for the three-component thiazoline-2-imine library. It is an intuitive result that product-based selection should increase in effectiveness as the number of components in a library increases: reactant-based selection takes no account of the relationship between reactants selected from different reactant pools, and the greater the number of pools the more of a limitation this is likely to become.

The effectiveness of product-based selection versus reactant-based selection using SUMC0J or SUM^v as the diversity metric is more pronounced for Daylight fingerprints as descriptors rather than UNITY fingerprints or Molconn-Z parameters. Daylight fingerprints include large structural fragments (containing up to 7 atoms) and hence the product molecules are likely

Table 6. Percentage differences in reactant-based selection versus product-based selection over the three descriptor types, and the three metrics for the amide and thiazoline-2-imine libraries. The libraries contain 900 products in configuration 30 x 30 for the amide library and configuration 6 x 10 x 15 for the thiazoline-2-imine library

Table 6. Percentage differences in reactant-based selection versus product-based selection over the three descriptor types, and the three metrics for the amide and thiazoline-2-imine libraries. The libraries contain 900 products in configuration 30 x 30 for the amide library and configuration 6 x 10 x 15 for the thiazoline-2-imine library

Diversity

Descriptor

Amide

Thiazoline

metric

% Diff

% Diff

SUM„s

Daylight

9.4

24.8

UNITY

5.9

12.9

Molconn-Z

6.5

12.6

SUM™

Daylight

1 2.5

22.3

UNITY

5.5

8.0

MolConn-Z

7.5

11.4

NN

Daylight

20.1

34.6

UNITY

20.5

35.6

Molconn-Z

37.7

49.2

to be represented by fragments that span reactants that originate in different pools: thus there will be more structural information encoded in the product molecules for use in the diversity calculation. This is especially the case for the three-component library. Therefore, it is not surprising that better results can be achieved by performing the analysis in product space. UNITY fingerprints consist of bits that are derived from paths within a molecule in addition to some structural keys that record the presence or absence of particular fragments. The structural keys tend to be more localised than the path-based fragments and hence there will be fewer bits that arise in the product molecules only. The molconn-Z parameters encompass a huge range of types of molecular descriptor, and it is thus more difficult to explain the precise magnitudes of the diversities resulting from their use.

The difference between product-based and reactant-based selection is most marked for the NN diversity metric. Choosing diverse reactants (using any metric) and then enumerating the products will result in clusters of closely related molecules in product space (since in a two-component library a given reactant from one pool will exist in product molecules with all the reactants from the other pool). The SUMC0J and SUM tan distance metrics are based on calculating the sum-of-pairwise dissimilarities between compounds and diverse sets of compounds found using these measures can contain compounds that are close in descriptor space [9], thus the existence of clusters of compounds can still result in high diversities using these metrics. The NN metric, however, favours an even distribution of compounds in descriptor space and the occurrence of clusters in product space is likely to result in a relatively poor NN diversity score. Thus, maximising the NN metric directly in product space is likely to produce a better spread of molecules throughout the space than can be achieved by considering the reactants alone.

Similar studies have been performed recently by Jamois [31] and Pearl-man [32]. Jamois has reported the difference in product-based selection relative to reactant-based selection as a percentage of the representativity of the subset relative to the entire virtual library. Pearlman has developed a method for library design that is called reactant-biased, product-based selection and compared the diversity of libraries designed by this method with cherry-picking in product space and reactant-based selection. Again the differences in diversity are reported as percentages. It is not possible to make a quantitative comparison between the approaches of Jamois and Pearlman and that described here since different diversity metrics and descriptors have been used. However, the results of both Pearlman and Jamois support our general findings in that they both conclude that product-based selection results in significantly more diverse libraries than does reactant-based selection.

0 0

Post a comment