Knowledge from reaction databases

The advent of reaction databases had a profound impact on the planning of reactions in the laboratory. Detailed information on a multitude of individual chemical reactions can be brought into the laboratory with the tip of a finger. With the growing size of these reaction databases - the largest among them contain several millions of reactions - it became of increasing interest to extract knowledge on chemical reactions from these databases, to learn rules on reaction types from a series of individual reactions. Such an inductive learning process has been the cornerstone of accumulating knowledge on chemical reactions from the very beginning; chemists have ordered individual reactions according to their common features, have thus defined reaction types and have made inferences on the outcome of a chemical reaction through reasoning by analogy.

It is attractive to perform such an inductive learning on an electronic basis now that reaction information has become available in electronic form. Neural networks are powerful inductive learning procedures with a broad range of applications in chemistry and drug design [30]. Since several years we are utilizing machine learning and unsupervised neural network methods for the classification of reactions into reaction types [27,28,31].

Such a classification can answer important questions met in combinatorial chemistry:

• Will my selected reaction proceed in the desired direction?

• Do I have a reaction with a broad scope?

• Are my selected reactions diverse enough?

Figure 9. A reaction is a point in a multidimensional space spanned by various physicochem-ical effects (A Hf: heat of reaction; R: resonance effect; q: charge distribution; %: inductive effect). Such a space is projected into two dimensions by a Kohonen network.

Figure 9. A reaction is a point in a multidimensional space spanned by various physicochem-ical effects (A Hf: heat of reaction; R: resonance effect; q: charge distribution; %: inductive effect). Such a space is projected into two dimensions by a Kohonen network.

Figure 10. Similarity of chemical reactions: different directions for different types of similarities and different distances for different degrees of similarities.

The investigation of a set of reactions first asks for an appropriate representation of chemical reactions. We have first concentrated on studying the influence of the structure of the starting materials onto the course of a chemical reaction, leaving aside, for the time being, the influence of reaction conditions. The course of a chemical reaction is largely governed by the physico-chemical effects exerted onto the reaction site, onto the bonds being broken and made during a reaction. These different physicochemical effects can be considered as coordinates of a space; a chemical reaction is then a point in such a multi-dimensional space. We will use a self-organizing neural network as introduced by Kohonen [32] to project such a multi-dimensional space into a discrete two-dimensional space (Figure 9) [27].

A two-dimensional space is particularly suited for the visualization of similarities between reactions (Figure 10). Different directions can represent different types of relationships - to the north lie different reaction types

Figure 11. Architecture of a Kohonen network. An input pattern (vector) X consists of m elements. Each neuron of the network is represented by a column with m weights, wjk.

than to the south east. And different distances represent different degrees of similarities - the closer two reactions are, the more similar they are.

A brief introduction to the self-organizing neural network developed by Kohonen seems warranted for an understanding of the investigations reported here. Figure 11 shows the architecture of a Kohonen network in the form of a two-dimensional arrangement of neurons. Each neuron, a vertical column, contains as many weights, wji, as there are input data (descriptors), x, for each object to be projected into this network.

In our application, an object is a chemical reaction characterized by physi-cochemical descriptors at the reaction site (see the following example). A reaction will be put into that neuron that contains weights most similar to the descriptors of the input object. A competitive learning algorithm will adjust the weights of the neurons of the network to the input data. Reactions described by similar physicochemical effects will be put into the same or adjacent neurons. Thus, a Kohonen network can be used for similarity perception and for clustering a dataset of reactions into reaction types [27,28].

In the following study a dataset of reactions was investigated where all reactions involved the breaking of a C-O and of a C-H bond and the making of a C-C bond. These reactions were characterized by the physicochemical descriptors indicated in Table 1.

The dataset consisted of 288 reactions that were projected into a Kohonen network of size 20 x 20. Figure 12 shows the view from above (cf. Figure

Table 1. Eight physicochemical parameters used to characterize each reaction center of 288 training reactions and 266 test reactions

MSE

C"

C+

«i

X

x

<7tot

X

X

Xx

X

X

MSE = mesomeric stabilization energy, a; = effective atom polarizability. 9tot = total charge. Xjt = if electronegativity.

MSE = mesomeric stabilization energy, a; = effective atom polarizability. 9tot = total charge. Xjt = if electronegativity.

; 7 3 * s a 7 e o 10 13 12 u is 10 1? ia 1« 20

1

ssgsitiisninnninnniD

2

ol

0

0

i

I

a

3

00

0 0

3

5

2

o"

7

ï

1

7

n

4

0

0

0

0

I

I

T

T

1

n

S

■ M

0

0

0

0

ô

0

I

1

X

X

JL

T

n

t

0

Ï

I

1

1

T

n

ï

mu

BEEB

T

X

1

1

i

t

&

EEH

bbbe

i.

X

I

x

»

■eh

EE EE

1

r.

10

■mi

Ï

I

ï

ï

r,

11

SHE

EBEEB

Ô

t

n

12

KM1K

■EEEB

V

I

I

V

ï

V

it

13

EBBE

EHEEE

Ô

±

V

I

?

V

a

14

EEEB

V

V

V

V

IS

KUKm

EBEBE

V

2

¥

V

B

■■sa

EEBBE

V

V

V

V

»

17

EEEEB

V

V

V

in

It

BEEE

EEEEBE

p

ï

V

2

»

19

3BEB

V

V

»

20

EBSEEEEEBEBHBnHKBQBX

Figure 12. Kohonen map obtained for the classification of 288 reactions, marked with the Bayes Theorem classification method. White boxes represent empty neurons; boxes with an x denote conflict neurons. The symbols characterize different classes found by Bayes classification.

11) onto the resulting network. Each neuron is marked to identify a specific reaction type that was assigned through a Bayes classification procedure [33].

The classification of reactions into reaction types by a combination of a Kohonen network and a Bayes classification compares favorably with the classification assigned intellectually by a chemist to the mapping by a Ko-honen network (Figure 13). (Note that separation into different coherent areas

« ■ £ w >js ■ i i ■ mi 111 ■ 111 j n bi i ■■■)■ ■■[]■■

krsskh■a■M■■n■ [ in■n■i Km m m MMB ■ ■ w ■ ■ 111 inni i ■ ■ i ■ K ■ Si K W M K >]■[)■[ UHlBiiriBI UIBI i ■iinai ■■■nacsHnrin anan annBi

SHHHHHHia ■MMBMBIIH

mmmmm

Figure 13. Comparison of classification methods: on the left-hand side the Kohonen map was marked with the Bayes Theorem classification method, on the right-hand side it was marked by chemists intellectually. The different reaction types are indicated by different symbols. Note that the symbols of the two networks cannot directly be compared. Only the assignment of similar coherent areas is important.

is the important result. The symbols of the individual reaction types cannot directly be compared.)

Once a Kohonen network has been trained with a series of reaction instances it can be used to assign a membership to a reaction type for a set of test reactions. Figure 14 shows this for 266 reactions having the same reaction center as the reactions studied above.

Kohonen networks trained in the fashion described above with a set of reactions can be used to define the diversity of reactions of a combinatorial chemistry experiment. Let us return for this endeavour to the experiment of a parallel synthesis of amides from acid chlorides and amines described in the previous section as an application of EROS to combinatorial chemistry. In this example, 15 acid chlorides were reacted with 15 amines to give 225 amides (see Figure 7). The question is now, do these reactions cover the entire space of amide formations from acid chlorides and amines? To answer this question a Kohonen network was trained with all 214 reactions found in the

1

2

3

A

5

a

7

8

B

10

11

11

13

14

IS

ie

17

18

18

20

i

0

o

O

o

ÏÏ

2

0

0

o

s

I

s

0

3

O

o

0

o

E

0

0.

I

o

4

o

Ô

ÏÏ

g

ÏÏ

i

0

o

o

o

o

Ô

s

ÏÏ

I

I

a

o

0

0

0

0

I

<>

7

Ô

s

£

I

ft

e

s

t

I

ô

X

10

£

%

11

2

12

I

s

n

*

t

n

u

n

M

T

T

J

!3

*

I

a

i

IS

am

T

Y

17

5

nu

18

I

±

«

oc

20

Jl

m

I

Ï

Figure 14. Kohonen map obtained for the classification of 266 reactions as test dataset. These reactions were projected into the trained network obtained by the classification of 288 reactions. The neurons were marked by chemists intellectually.

Figure 14. Kohonen map obtained for the classification of 266 reactions as test dataset. These reactions were projected into the trained network obtained by the classification of 288 reactions. The neurons were marked by chemists intellectually.

Table 2. Eight physicochemical parameters used to characterize each reaction center of225 amide building reactions

Electronic

H-N

C-C1 ^ N-C

variable

q tot

x

x

x

x

%n

x

x

a,

x

= s electronegativity. Xn = p electronegativity. a i = effective atom polarizability.

= s electronegativity. Xn = p electronegativity. a i = effective atom polarizability.

Theilheimer reaction database for amide formation from acid chlorides and amines. These reactions were represented by eight physicochemical effects at the reaction site as shown in Table 2.

In other words, the information in the Theilheimer database was taken as representing the universe of amide formations from acid chlorides and amines. If this is accepted - which is presently an acceptable definition given what information is available in electronic form - the weights of the Kohonen

Figure 15. Visualization of the reaction space: on the left-hand side the 214 reactions (obtained from the Theilheimer database) correspond to the complete reaction space, whereas on the right-hand side the 225 reactions (corresponding to the reaction subspace) cover less than 50% of the trained network.

Figure 15. Visualization of the reaction space: on the left-hand side the 214 reactions (obtained from the Theilheimer database) correspond to the complete reaction space, whereas on the right-hand side the 225 reactions (corresponding to the reaction subspace) cover less than 50% of the trained network.

network (of size 15 x 15) trained with these 214 reactions store the entire scope of amide formations conceivable from these starting materials. Now we send the dataset of 225 reactions obtained from the previously mentioned 15 acid chlorides and 15 amines through this network. Each of these test reactions will be mapped into a neuron that has weights most similar to the descriptors of the reaction considered. Figure 15 compares the Kohonen maps of the 214 reactions from the Theilheimer database and the maps obtained by sending the 225 test reactions through this network of the Theilheimer database, used as reference network.

This analysis shows that the 225 test reactions cover less than 50% of the space defined by the Theilheimer database. This suggests the need for additional reactions to be explored to cover the entire range of amide formations exemplified by the Theilheimer database.

0 0

Post a comment