Data fusion

Background

Data fusion is 'a process of combining inputs from sensors with information from other sensors, information processing blocks, databases, or knowledge bases, into one representational format' [8]. Defence applications have provided much of the driving force for the development of data fusion techniques, with published examples including establishing the friend-or-foe nature of an incoming missile or aeroplane, predicting the range and direction of a battlefield target, and navigating an un-manned armoured vehicle. Other applications include surveillance operations by law enforcement agencies, real-time control of continuous manufacturing processes, the provision of all-weather visibility for aircraft pilots, and multi-imaging systems for the analysis of medical images (see, e.g. [9]). However, data fusion can be, and is, used in much more commonplace situations: for example, establishing that it is safe to cross a road involves taking input from one's ocular sensors (eyes) and aural sensors (ears), and then combining this information with the knowledge that an empty road is a safe road to give an output denoting the safety of the proposed action. Again, a committee in which all members can contribute will often arrive at a superior decision to the one that would have been reached by just the committee chair - although there are, of course, many exceptions to such a rule!

The basic rationale for data fusion is that using the information presented by a number of sensors enables further information to be inferred that would be outside the capabilities of a single sensor. For example, if one sensor detects a tank, then all that can be deduced is the existence and the position of that tank. However, if two sensors detect the same tank then inferences can be made regarding the direction of its movement, while the addition of a temporal dimension permits the tank's velocity to be calculated. Add to that the ability to compare the observed behaviour with records of past behaviour of tanks and the system becomes capable of threat analysis. As well as being able to infer more information, the use of a fusion system also leads to both qualitative and quantitative improvements in several ways. Thus, improved operational performance can occur if one of the sensors were to become damaged, as there would still be information coming in from the others (an obvious advantage in military applications where sensors will be exposed to combat conditions and are thus liable to become damaged). Data fusion leads to extended coverage since multiple sensors can cover disparate areas, times and qualities, and it leads to an increased level of confidence in the results since multiple sensors can act together to confirm an event and to reduce any ambiguity surrounding, e.g., the classification of an event.

Combination of rankings

Our interest in data fusion methods arose from recent work on their application to information retrieval (IR), specifically to the combination of the rankings produced by different retrieval mechanisms when applied to databases of textual documents. An early study is that by Belkin et al. [10], in which data fusion was used to combine the results of a series of searches of bibliographic databases, conducted in response to a single query, but employing different indexing and searching strategies. A query was processed using different strategies, each of which was used to produce a ranking of a set of documents in order of decreasing similarity with the query. The ranks for each of the documents were then combined using one of several different fusion rules (including the MIN, MAX and SUM rules discussed below); the output of the fusion rule was taken as the document's new similarity score and the fused lists were then re-ranked in descending order of similarity. This work soon led to many other studies (see, e.g. [11-13]) and the combination of document rankings is now a well-established technique, as is exemplified by its use in a meta-search engine that provides access to the World Wide Web using a combination of different search engines [14].

The work on chemical data fusion reported here is based directly on these previous IR studies, and involves the simple procedure shown in Scheme 1, where a user-defined target structure is searched against a database using several different similarity measures. The fusion rules that we use here are based on those identified by Belkin et al. [10], and are summarised in Table 1. It will be seen that the MIN and MAX rules represent the assignment of extreme ranks to database structures and it is thus hardly surprising that both can be highly sensitive to the presence of a single 'poor' retrieval system amongst those that are being combined. The SUM rule is expected to be more stable against the presence of a single poor or noisy input ranking; here, each database structure is assigned the sum of all the rank positions at which it occurs in the input lists. This report considers just these three rules but

1. Execute a similarity search of a chemical database for some particular target structure using two, or more, different measures of inter-molecular structural similarity.

2. Note the rank position, ri, of each database structure in the ranking resulting from use of the i-th similarity measure.

3. Combine the various rankings using one of the fusion rules (MIN, MAX or SUM), giving a new combined score for each database structure.

4. Rank the resulting combined scores, and then use this ranking to calculate a quantitative measure of the effectiveness of the search for the chosen target structure.

Scheme 1. Combination of similarity rankings using data fusion.

Table 1. Fusion rules for combining n ranked lists, where rt denotes the rank position of a specific database structure in the i-th (1 < i < n) ranked list

Name

Fusion rule

MIN

minimum (r1s r2, . . ., r

i, . . ., r„)

MAX

maximum (ri, r>, . . ., r

\, .. ., r„)

SUM

X„ =1 r/

there are clearly many others that could be considered, e.g., the median, the product, the harmonic mean, etc. of the individual rankings.

The combined scores output by the fusion rule are then used to re-order the database structures to give the final ranked output. In many cases, especially with the SUM rule, the application of the fusion rule may result in the assignment of the same score to two or more items. When this happens, it is necessary to specify a further sort key to allow the resolution of the tied structures, e.g., alphabetical ordering of the canonicalised connection tables describing the tied database structures or the allocation of weights to individual rankings (perhaps based on past performance in Similarity searches) so that a high position in one ranking would differ in importance from that same position in another ranking.

Chemical applications

Chemical applications of data fusion are not completely novel. As long ago as 1973, Clerc and Erni noted that 'when data from several different spectro-

scopic methods are used for comparison purposes, greatly enhanced performance may be expected because the methods complement each other' [15] and went on to discuss the use of a scoring scheme based on weighted contributions from each of several molecular properties and spectra. More recently, Masui and Yoshida [16] have reported the use of the SPECTRA system for combining the similarity scores obtained in searches ofa database containing mass, IR, and 1H and 13C NMR spectral data when one or more of the spectra are missing for a particular sample molecule. In work more analogous to that reported here, Kearsley et al. have used both similarity-based and rank-based procedures to combine pairs of similarity searches of the Standard Drug File database, and found that significant improvements in performance could be achieved in simulated property prediction experiments [17, 18]. Finally, So and Karplus have recently advocated combining different QSAR methods to obtain models with heightened predictivity [19].

Our initial studies of data fusion were undertaken as part of a project to evaluate the EVA descriptor, which characterises a molecule by its fundamental vibrational fingerprint [20]. Although originally developed for QSAR applications, the EVA descriptor can also be used for similarity searching and a range of EVA-based similarity measures were hence evaluated using a dataset containing 8178 molecules from the Starlist file [21]. Comparable searches were also carried out using the 2D similarity searching routines in the UNITY chemical information management system [22], and using data fusion to combine the two individual types of ranking. Simulated leave-one-out property prediction experiments using the logP data in the Starlist file showed that, on average, the fused rankings appeared to be better than the original 2D and EVA rankings. Although the differences were not always statistically significant, the study provided at least some evidence that data fusion could be used to improve the performance of similarity searching in chemical databases: the remainder of this article reports further experiments that have been undertaken to ascertain the accuracy of this conclusion. Full details of the work are provided by Ginn [23].

Super SEO GuideBook

Super SEO GuideBook

This course covers everything that you could ever want toknow about getting high rankings in the search engines. Many courses only give you a little bit of information and then try to sell you additional courses with the real secrets in them. Youll never have to worry about that with this course.

Get My Free Ebook


Post a comment