Using Gene Expression

There are three components to the empirical approach of developing a predictive classifier. The first component is determining which genes to include in the predictor. This is generally called "feature selection." Including too many "noise variables" in the predictor usually reduces the accuracy of prediction. The second component is specification of the mathematical function that will provide a prediction for any given expression vector.

The third component is parameter estimation. Most predictors have parameters that must be assigned values before the predictor is fully specified. For many kinds of predictors there is also a cut-point that must be specified for translating a quantitative predictive index into a prediction (e.g., 0 or 1) for binary class prediction problems.

Feature selection is usually based on identifying the genes that are differentially expressed among the classes when considered individually. For example, if there are two classes, one can compute a i-test or a modified i-test in which a hierarchical variance model is used for increasing the degrees of freedom for estimation of the gene-specific within-class variances (2). The logarithm of the expression measurements are used as the basis of the statistical significance tests. The genes that are significantly differentially expressed at a specified significance level are selected for inclusion in the class predictor. The stringency of the significance level that is used controls the number of genes that are included in the model. Although many computationally complex methods have been published to identify optimal sets of genes that together provide good discrimination, little compelling evidence currently exists that the computational effort of these methods is warranted.

Many algorithms have been used effectively with DNA microarray data for predicting of a binary outcome, e.g., response versus non-response. Dudoit et al. (3) compared several algorithms using several publicly available data sets. A linear discriminant is a function where xi denotes the logarithm of the expression measurement for the /th gene, wi is the weight given to that gene, and the summation is over the set F of features (genes) selected for inclusion in the class predictor. For a two-class problem, there is a threshold value d, and a sample with expression profile defined by a vector x of values is predicted to be in class 1 or class 2 depending on whether l(x) as computed from equation (1) is less than the threshold d or greater than d, respectively.

Many types of classifiers are based on linear discriminants of the form shown in (1). They differ with regard to how the weights are determined. The oldest form of linear discriminant is Fisher's linear discriminant. To compute the weights for the Fisher linear discriminant, one must estimate the correlation between all pairs of genes that were selected in the feature selection step. The study by Dudoit et al. indicated that Fisher's linear discriminant did not perform well unless the number of selected genes was small relative to the number of samples. The reason is that in other cases there are too many correlations to estimate and the method tends to be unstable and over-fit the data.

Diagonal linear discriminant analysis is a special case of Fisher linear discriminant analysis in which the correlation among genes is ignored. By ignoring such correlations, one avoids having to estimate many parameters, and obtains a method that performs better when the number of samples is small. Golub's weighted voting method (4) and the Compound Covariate Predictor of Radmacher et al. (5) are similar to diagonal linear discriminant analysis and tend to perform very well when the number of samples is

small. They compute the weights based on the univariate prediction strength of individual genes and ignore correlations among the genes.

Support vector machines are very popular in the machine learning literature. Although they sound very exotic, linear kernel support vector machines do class prediction using a predictor of the form of equation (1). The weights are determined by optimizing a misclassification rate criterion, however, instead of a least-squares criterion as in linear discriminant analysis (6). Although there are more complex forms of support vector machines, they appear to be inferior to linear kernel SVM's for class prediction with large numbers of genes (7).

In the study of Dudoit et al. (3), the simplest methods, diagonal linear discriminant analysis, and nearest neighbor classification, performed as well or better than the more complex methods. Nearest neighbor classification is defined as follows. It depends on a feature set F of genes selected to be useful for discriminating the classes. It also depends upon a distance function d(x, y)w which measures the distance between the expression profiles x and y of two samples. The distance function utilizes only the genes in the selected set of features F. To classify a sample with expression profile y, compute d(x, y)f for each sample x in the training set. The predicted class of y is the class of the sample in the training set thst is closest to y with regard to the distance function d. A variant of nearest neighbor classification is ^-nearest neighbor classification. For example with 3-nearest neighbor classification, you find the three samples in the training set that are closest to the sample y. The class that is most represented among these three samples is the predicted class for y. Tibshirani et al. (8) developed a variant called shrunken cen-troid classification that combines the gene selection and nearest centroid classification components.

Dudoit et al. also studied some more complex methods such a classification trees and aggregated classification trees. These methods did not appear to perform any better than diagonal linear discriminant analysis or nearest neighbor classification. Ben-Dor et al. (7) also compared several methods on several public datasets and found that nearest neighbor classification generally performed as well or better than more complex methods.

10 Ways To Fight Off Cancer

10 Ways To Fight Off Cancer

Learning About 10 Ways Fight Off Cancer Can Have Amazing Benefits For Your Life The Best Tips On How To Keep This Killer At Bay Discovering that you or a loved one has cancer can be utterly terrifying. All the same, once you comprehend the causes of cancer and learn how to reverse those causes, you or your loved one may have more than a fighting chance of beating out cancer.

Get My Free Ebook

Post a comment