## Clustering and Mapping Algorithms

Once the chemical structures are encoded by an appropriate descriptor set, the similarities of descriptors must be calculated. Descriptor similarities are the base for a selection of compounds according to diversity or similarity . Diversity selection techniques fall into four classes:

1. Dissimilarity-based selection.

2. Mapping-based selection.

3. Cluster-based selection.

4. Partition-based selection.

### 7.4.1 Distance Metric

The most commonly used methods for calculation of descriptor similarities are the Tanimo-to coefficient, the Euclidean distance, and the Mahalanobis distance (Fig. 12).

### 7.4.1.1 Tanimoto Coefficient

Similarity of binary molecular FP is described by calculating simple similarity coefficients , The most often used Tanimoto coefficient T(a,b) is defined as the number of bit positions set in both individual bitstrings normalized by the number of substructures in common:

T(a,b) represents the similarity between the FP of molecule a and b. N(a,b) denotes the number of common substructures, N(a) denotes the number of substructures which appears in molecule a, and N(b) denotes the number of substructures present in molecule b.

The Tanimoto coefficient is 1.0 for identical FP, and 0.0 if the molecules a and b have no substructure in common. Two molecules are marked as "similar" if the Tanimoto coefficient of their fingerprints is greater than 0.85 ,

### 7.4.1.2 Euclidean Distance

To compute Euclidean distances, any descriptor of n components is looked at as a vector in «-dimensional space. The dissimilarity is defined as the distance between the points addressed by the vectors (e.g., descriptors consisting of three values a - (alt a2, a3) for each molecule refer to points in three-dimensional space. The dissimilarity of two vectors a and b is given by the distance between the two corresponding points in space).

Dissimilarities S(a,b) between two molecules described by high-dimensional descriptors are calculated the same way:

S2(a,b) = l(a,-b,)2 i=i a, denotes the component i of the descriptor of molecule a and num the number of components of the descriptor. The distance S(a,b) can be computed for arbitrary descriptors.

7.4.1.3 Nonlinear Distance Scaling

Quality of distance measurements can be improved by nonlinear scaling (Gaussian-type or exponential functions) of the distances S(a,b).

Often, not only the similarity between two molecules is of interest, but also the similarity between one compound and a set of compounds (e.g., all pre-selected structures from a library). This problem can be addressed by summing all scaled S(a,b) values for all interesting molecule pairs .

### 7.4.1.4 Mahalanobis Distance

A more general way to describe the similarity of one molecule and a predefined set is calculation of the so-called Mahalanobis distance [77,78]. This forms an elliptical similarity region that fits the real shape of a group of molecules in the chemical space much better than the Euclidean distance.   Figure 12. Regions of a two-dimensional descriptor space defined by different similarity computation methods. The small circles show positions of a predefined set of molecules (e.g., cluster of active compounds). Similar compounds are found in the marked regions. The dissimilarities are calculated as: a) Euclidean distances to the cluster center; b) Mahalanobis distances to each molecule of the predefined set; and c) sum of all nonlinearly scaled Euclidean distances.

### 7.4.2 Dissimilarity-Based Selection

A number of different algorithms have been proposed to select a maximally diverse subset from a set of descriptors. Normally, the diverse subset is generated by selecting an initial molecule at random from the database, and then repeatedly selecting that molecule that is as different as possible from those that have already been selected [79-82].

The method can be improved by removing all compounds from the dataset that are similar to already selected ones in each iteration .  •Si r ★ ft ★ ft