The cut point is the z where P[S2lz] = rh . which is the same as above as long as the logistic model has this form. Usually, the first log term is estimated from the data, which will give a result that is not optimal for the real prior probabilities.

The receiver operating characterization (ROC) curve is constructed by sweeping the black line from right to left and calculating a and p at each point. The results are shown in Figure 4 for all four cases. The positively sloped diagonal line represents the case when the distributions are exactly the same. The letters for the cases in Figure 3 are placed at the optimal points, respectively. For equally likely priors, they fall on the negatively sloped diagonal line. A statistical test can be performed to see if the area under the ROC curve

Figure 4 ROC curves for cases A to D.

Figure 4 ROC curves for cases A to D.

is significantly greater than xh. Unfortunately, this only indicates that the biomarker can detect one of two signals under ideal conditions. A better test might be one that can detect if the length of the vector (V P2 (z) + [1 - P2 (z)]2)

from the lower right corner to the optimal point extends significantly beyond the positive diagonal line. It should be noted that this length measure depends both on the data observed and on the priors. If one of the two signals is rare, it would be very hard to show significance. This is not true for the general test of ROC area.

With respect to evaluating a biomarker, Figures 3 and 4 demonstrate some interesting aspects. Case A represents a situation where a single test discriminates well between S\ and S2 but it is not noise-free. It would be difficult to do better than this in the real world. What is striking is that case B, which shows essentially no visible separation, actually has an ROC area greater than xh that could be detected with sufficiently large n. Typical ROC curves look more like case C, perhaps slightly better. The densities in this case show only modest separation and would not make a very impressive biomarker. However, the ROC curve area test might tend to get people excited about its prospects. If one of the signals is rare, it becomes essentially undetectable, but the ROC area test gives no indication of this. Case C is a good candidate for adding another test to make it a multivariate biomarker. In higher dimensions, greater separation may occur (i.e., more information might be transmitted).

However, the ROC curve cannot handle more than two outcomes, at least not in a visually tractable way, although collapsing cells by summing in Table 1 to the 2 x 2 case might work in some cases. This takes us back to I(S,D) as a measure of the biomarker's utility since it works for any number of signals and any number of tests. A generalization of the discriminant function would minimize the total noise, obtained by summing the off-diagonal elements of Table 1:

This minimization would probably be difficult using calculus as before and would require a numerical analysis approach, too involved to describe here. This minimization would determine the R' s. The partitions, the edges of the R's, are a point, a line, a plane, and flat surfaces of higher dimension as the number of tests increases from one, respectively, assuming that the parameters of the P' s differ only in the means and that the P's are Gaussian. When the other parameters differ among the P's, the surfaces are curved.

Without some involved mathematics, it is difficult to know if the minimal noise optimization is equivalent to the maximal information optimization. It seems entirely feasible that communication systems with the same total noise could have different diagonal elements, one of which might give the most information about all the signals in the system. The answer is unknown to the author and is probably an open research question for biological systems. Another level of complication, but one that has more real-world relevance, is one that minimizes the cost. This is discussed at an elementary level elsewhere [24] . Multiplying each term in Table 1 by its cost, and minimizing the total expected cost, is a classical decision theory approach [25-27].

The previous discussion of diagnostic performance has said little about time effects. The time variable is contained in P and will add another dimension for each time point measured, in general. This presents a severe dimensionality problem, both for the biologist and the analyst, since each measurement of the biomarker on the same person creates a new dimension. If the measurement times are not the same for all cases, the entire optimization process may depend on which times are chosen. The biggest problem with dimensionality is that it usually involves a growing number of parameters. To obtain precise (high-Fisher-information) estimates of all the parameters simultaneously, the sample size requirement grows much faster than the dimension. Here is a place where invariance plays a key role. If some parameters can be shown to be biological constants analogous to physical constants, through data pooling they can be estimated once and reused going forward. If the parameters vary with time, the time function for the parameter needs to be determined and reused similarly. Functions and parameters are very compact objects for storing such reusable knowledge.

Project Management Made Easy

Project Management Made Easy

What you need to know about… Project Management Made Easy! Project management consists of more than just a large building project and can encompass small projects as well. No matter what the size of your project, you need to have some sort of project management. How you manage your project has everything to do with its outcome.

Get My Free Ebook

Post a comment