## Kahn dataset

The dataset

The final section evaluates data fusion when a larger number of original similarity measures is available. The dataset used here is described by Kahn in a discussion of descriptors for the analysis of combinatorial libraries [31]: it contains 75 compounds each belonging to one of 14 well-defined activity classes (angiotensin-converting enzyme inhibitors, acetylcholine receptor inhibitors, antagonists of 2-aminoproprionic acid, aldose reductase inhibitors, angiotensin-II receptor antagonists, beta adrenergic blockers of the type-3 receptor, cyclo oxygenase 2 receptor antagonists, dopamine 3 receptor (ant)ago-nists, endothelin receptor (ant)agonists, histamine 2 antagonists, neurokinase-1 receptor antagonists, HIV-1 protease inhibitors, non-nucleoside HIV reverse transcriptase inhibitors, and steroid aromatase inhibitors).

Six similarity measures were used to generate rankings: the Molecular Simulations Inc. (MSI) [32] Jurs descriptors; FBSS (as discussed in the previous section); two types of ChemX 3D flexible fingerprints [33]; and two types of Daylight 2D fingerprints [34]. The Jurs descriptors are part of the MSI Cerius2 package, and describe shape and electronic charge by mapping the atomic partial charges onto the solvent accessible areas of the individual atoms within a molecule. All of the 30 possible Jurs descriptors [35] were calculated for each member of the dataset. The values were then normalised, and the similarity between pairs of sets of values calculated using the non-binary Tanimoto coefficient. In what follows, the inclusion of the Jurs rankings in a fusion combination is indicated by 'J'. The FBSS similarity measure has been described previously: its inclusion in a fusion combination is denoted by 'F'. The ChemX 3D flexible fingerprint keys record the presence or absence of potential pharmacophoric patterns (consisting of three pharmacophore centres and the associated inter-atomic distances) in any of the low-energy conformations identified by a rule-based conformational analysis of a molecule. Two sets of similarity scores were generated from these fingerprints: the Tanimoto coefficient scores and the Tversky similarity scores [5,36], the inclusion of these in a fusion combination being denoted by '3' or by 'T', respectively. The Daylight fingerprints were based on unfolded fingerprints considering pathlengths of up to 7, the inclusion of these in a fusion combination being denoted by '2' (for a standard fingerprint where a bit is either set or not set) or by 'N' (for a fingerprint where a count is kept of how many times each bit is set), respectively. Thus 23F, for example, represents the fusion of the standard Daylight, Tanimoto ChemX and FBSS rankings. The similarity scores for these experiments were calculated using either the binary or non-binary versions of the Tanimoto coefficient, as appropriate.

### Fusion results

In view of its performance in the studies discussed above, we used just the SUM rule for the fusion experiments, with all possible combinations ofrank-ings from the similarity methods being studied (in much the same way as So and Karplus have very recently evaluated the effectiveness of all possible combinations of seven different QSAR methods [19]). Table 5 details the mean numbers of actives (i.e., molecules with the same activity as the target structure) found in the top-10 nearest neighbours when averaged over all 75 target structures. The values of c at the top of the table denote the number of similarity measures that were fused (so that, e.g., c = 1 represents the original measures and c = 2 represents the fusion of a pair of the original measures) and a bold-font underlined element indicates a fused combination that is better than the best original individual measures (which was the ChemX keys with the Tanimoto coefficient).

It will be seen that very many of the fused combinations in Table 5 are bold underlined, rhus providing further support for the use of SUM to fuse similarity rankings, and Ginn reports similar results from other analyses of this dataset [23]. The table also shows that the fraction of the combinations

Table 5. Mean number ofactives found in the 10 nearest neighbours when combining various numbers, c, of different similarity measures for searches ofthe Kahn dataset. Bold underlined entries indicate a fused result at least as good as the best original similarity measure

c |
= 1 |
c = |
2 |

2 |
0.80 |
23 |
1.10 |

3 |
1.12 |
2F |
1.04 |

F |
0.89 |
2J |
1.01 |

J |
1.08 |
2N |
0.68 |

N |
0.63 |
2T |
0.95 |

T |
0.69 |
3F |
1.09 |

3J |
1.25 | ||

3N |
1.00 | ||

3T |
1.32 | ||

FJ |
1.20 | ||

FN |
0.91 | ||

FT |
1.11 | ||

JN |
0.89 | ||

JT |
0.93 | ||

NT |
0.85 |

2FN 1.08

2FT 1.28

2JN 1.03

2JN 1.10

2NT 0.95

3FJ 1.40

3FN 1.19

3FT 1.33

3JN 1.25

3JT 1.45

3NT 1.20

FJN 1.11

FJT 1.21

FNT 1.11

JNT 1.12

c = 4 |
c = 5 | ||

23FJ |
1.52 |
23FJN |
1.45 |

23FN |
1.23 |
23FJT |
1.69 |

23FT |
1.43 |
23FNT |
1.36 |

23JN |
1.31 |
23JNT |
1.43 |

23JT |
1.45 |
2FJNT |
1.43 |

23NT |
1.25 |
33FJN |
1.51 |

2FJN |
1.28 | ||

2FJT |
1.53 | ||

2FNT |
1.28 | ||

2JNT |
1.17 | ||

3FJN |
1.35 | ||

3FJT |
1.55 | ||

3FNT |
1.41 | ||

3JNT |
1.36 | ||

FJNT |
1.32 |
23FJNT 1.43 c c that are bold underlined increases in line with c, so that all combinations with c >4 perform at least as well as the best of the individual similarity measures. However, it is not the case that, e.g., the c = 5 combinations are invariably superior to the c = 4 combinations, and the best result overall was obtained with 23FJT (rather than with 23FJNT, the combination involving all of the individual measures). Thus, while simply fusing as many individual measures as are available in a similarity investigation would appear to perform well, superior results may be obtained from fusing a subset of the individual measures; this has also been noted in searches of text databases [10] but there is no obvious predictive mechanism for identifying an optimal combination a priori [23,37]. |

## Post a comment