Similarity Search

The notion of molecular similarity (or chemical similarity) is one of the most useful and at the same time one of the most contradictory concepts in chemoinformatics.247,248 The concept of molecular similarity plays an important role in many modern approaches to predicting the properties of chemical compounds, designing chemicals with a predefined set of properties and, especially, in conducting drug design studies by screening large databases containing structures of available (or potentially available) chemicals. These studies are based on the similar property principle of Johnson and Maggiora, which states: similar compounds have similar properties.247 The similarity-based virtual screening assumes that all compounds in a database that are similar to a query compound have similar biological activity. Although this hypothesis is not always valid (see discussion in ref. 249), quite often the set of retrieved compounds is considerably enriched with actives.250

To achieve high efficacy of similarity-based screening of databases containing millions compounds, molecular structures are usually represented by screens (structural keys) or fixed-size or variable-size fingerprints. Screens and fingerprints can contain both 2D- and 3D-information. However, the 2D-fingerprints, which are a kind of binary fragment descriptors, dominate in this area. Fragment-based structural keys, like MDL keys,62 are sufficiently good for handling small and medium-sized chemical databases, whereas processing of large databases is performed with fingerprints having much higher information density. Fragment-based Daylight,251 BCI,252 and UNITY 2D253 fingerprints are the best known examples.

The most popular similarity measure for comparing chemical structures represented by means of fingerprints is the Tanimoto (or Jaccard) coefficient T.254 Two structures are usually considered similar if T > 0.85250 (for Daylight fingerprints251). Using this threshold, Taylor estimated a probability to retrieve actives as 0.0 1 2-0.50,255 whereas according to Delaney this probability is even higher, i.e., 0.40-0.60 (ref. 256) (using Daylight fingerprints251). These computer experiments confirm the usefulness of the similarity approach as an instrument of virtual screening.

Schneider et al. have developed a special technique for performing virtual screening referred to as Chemically Advanced Template Search (CATS).257 Within its framework, chemical structures are described by means of so-called correlation vectors, each component of which is equal to the occurrence of a given atom pair divided by the total number of non-hydrogen atoms in it. Each atom in the atom pair is specified as belonging to one of five classes (hydrogen-bond donor, hydrogen-bond acceptor, positively charged, negatively charged, and lipophilic), while topological distances of up to ten bonds are also considered in the atom-pair specification. In ref. 257, the similarity is assessed by Euclidean distance between the corresponding correlation vectors. CATS has been shown to outperform the MERLIN program with Daylight fingerprints251 for retrieving thrombin inhibitors in a virtual screening experiment.257

Hull et al. have developed the Latent Semantic Structure Indexing (LaSSI)

o co o cq approach to perform similarity search in low-dimensional chemical space. ' To reduce the dimension of initial chemical space, the singular value decomposition method is applied for the descriptor-molecule matrix. Ranking molecules by similarity to a query molecule was performed in the reduced space using the cosine similarity measure,260 whereas the Carhart's atom pairs154 and the Nilakantan's topological torsions95 were used as descriptors. The authors claim that this approach "has several advantages over analogous ranking in the original descriptor space: matching latent structures is more robust than matching discrete descriptors, choosing the number of singular values provides a rational way to vary the 'fuzziness' of the search''.258

The issue of ''fuzzification'' of similarity search has been addressed by Horvath et al.155 157 The first fuzzy similarity metric suggested155 relies on partial similarity scores calculated with respect to the inter-atomic distances distributions for each pharmacophore pair. In this case the "fuzziness" enables comparison of pairs of pharmacophores with different topological or 3D distances. Similar results156 were achieved using fuzzy and weighted modified Dice similarity metric.260 Fuzzy pharmacophore triplets (FPT, see Section can be gradually mapped onto related basis triplets, thus minimizing binary classification artifacts.157 In a new similarity scoring index introduced in ref. 157, the simultaneous absence of a pharmacophore triplet in two molecules is taken into account. However, this is a less-constraining indicator of similarity than simultaneous presence of triplets.

Most similarity search approaches require only a single reference structure. However, in practice several lead compounds are often available. This motivated Hert et al261 to develop the data fusion method, which allows one to screen a database using all available reference structures. Then, the similarity scores are combined for all retrieved structures using selected fusion rules. Searches conducted on the MDL Drug Data Report database using fragment-based UNITY 2D,253 BCI,252 and Daylight251 fingerprints have proved the effectiveness of this approach.

The main drawback of the conventional similarity search concerns an inability to use experimental information on biological activity to adjust similarity measures. This results in an inability to discriminate relevant and non-relevant fragment descriptors used for computing similarity measures. To tackle this problem, Cramer et al. 42 developed substructural analysis, in which each fragment (represented as a bit in a fingerprint) is weighted by taking into account its occurrence in active and in inactive compounds. Subsequently, many similar approaches have been described in the literature.262

One more way to conduct a similarity-based virtual screening is to retrieve the structures containing a user-defined set of "pharmacophoric" features. In the Dynamic Mapping of Consensus positions features are selected by finding common positions in bit strings for all active compounds. The potency-scaled DMC algorithm (POT-DMC)264 is a modification of DMC in which compounds activities are taken into account. The latter two methods may be considered as intermediate between conventional similarity search and probabilistic SAR approaches.

Batista, Godden and Bajorath have developed the MolBlaster method,208 in which molecular similarity is assessed by Differential Shannon Entropy265 computed from populations of randomly generated fragments. For the range 0.64 < T < 0.99, this similarity measure provides with the same ranking as the

Tanimoto index T. However, for smaller values of T the entropy-based index is more sensitive, since it distinguishes between pairs of molecules having almost identical T. To adapt this methodology for large-scale virtual screening, Proportional Shannon Entropy (PSE) metrics were introduced.209 A key feature of this approach is that class-specific PSE of random fragment distributions enables the identification of the molecules sharing with known active compounds a significant number of signature substructures.

Similarity search methods developed for individual compounds are difficult to apply directly for chemical reactions involving many species subdivided by two types: reactants and products. To overcome this problem, Varnek et al.18 suggested condensing all participating reaction species in one molecular graph [Condensed Graphs of Reactions (CGR),18 see Section 1.3.2] followed by its fragmentation and application of developed fingerprints in "classical" similarity search. Besides conventional chemical bonds (simple, double, aromatic, etc.), a CGR contains dynamical bonds corresponding to created, broken or transformed bonds. This approach could be efficiently used for screening of large reaction databases.

0 0

Post a comment