Multidimensional combinatorial library optimization

Multiple criteria for the selection of an optimal sub-library

A real-world example for what is needed today in combinatorial library optimization is the following. Given 7262 reagents A and 1761 reagents B, these can be assembled into a virtual library of 13 106 products A-B (one step reaction, exact data not shown - proprietary scaffold). These building blocks are selected exclusively under synthesis considerations, i.e., no other criteria like diversity are involved in this step. The synthesis robot can handle 15 by 15 reagents in one run. The task is to find an optimal 15 x 15 sub-library (225 products) out of the large virtual library. The criteria should be the crop protection score (see above), diversity, and the price of the starting materials. There are 1082 possible sub-libraries, i.e. much too many for a systematic exploration.

Genetic algorithm

Gillet et al. [15] proposed to solve such problems by a genetic algorithm (GA) [20]. Other GA applications in library design have been used in lead optimization [16-17], in library mixture optimization [18], and for the selection of preferable compounds from a large virtual library (i.e., 'cherry picking') [ 19]. Genetic algorithms optimize a population of individuals (possible solutions) by improving their 'fitness', i.e., the adaption to the problem, by applying principles of the natural evolution like 'mutation' and 'crossover'. The implementation used here is based on the Genesis program [21]. The individuals in a population are different 15 x 15 sub-libraries out of the virtual library described above. Their fitness is the weighted sum of the percentage of compounds with a crop protection score greater than 0.3, of a diversity index, and of the reciprocal prices of the starting materials. The GA was run with a population size of 50, a maximum number of generations of 200, a mutation rate of 0.1%, and a cross-over rate of 60%. These are more or less the recommended default values [21]. Sufficient convergence could be reached with these parameters (data not shown).

The diversity of the sub-libraries was calculated from the products instead of from the individual building blocks. It was shown recently, that particularly

Figure 6. Distribution of the percentage of suitable compounds (crop protection score >0.3) in 10 000 randomly drawn 15 x 15 libraries.

for diversity optimization such an approach is superior [22]. In addition, the crop protection score cannot be calculated from the starting materials.

The diversity index is the normalized sum of the absolute differences of the Ghose/Crippen fingerprints [4] of all pairs of compounds within a given 15 x 15 sub-library. The Ghose/Crippen fingerprints of the 225 products in a given sub-library are calculated from the fingerprints of the individual building blocks. Since these fingerprints are also the basis for the crop protection score, the computation of the fitness function is very effective in terms of computer resources. It takes less than 30 min to do the 10 000 fitness function calculations needed for one optimization run on an SGI R10000 processor (data not shown).

0 0

Post a comment