Is That Significance Significant And Conversely the Significant Insignificance

Statistics texts all deal with the issue of type I and type II errors, or, respectively, the false finding of a significant difference where none exists and the false finding of no significant difference where one indeed exists. They seem invariably to omit mention, however, of how a statistically significant difference that truly exists and is truly shown can yet remain truly irrelevant. This particular possibility gets no press time in statistics manuals because it remains in the realm of that softer science mentioned above, judgment.

As trained scientists sensitized to the dangers of making inferences based on anecdotal or potentially chance findings, we seem as a group to have acquired the tendency to overexalt the statistically significant finding when it makes its appearance. A p-value of 0.05 or less has become our holy grail, the golden chalice that shines from the paragraphs of our grant and investigational new drug (IND) applications, presaging, we hope, the sweet wine to spill forth from them. This, of course, is not unnatural when one considers how often we are faced with budgetary or time constraints on the number of subjects that can be tested and therefore on the ease of statistically confirming such differences when they truly exist. The converse situation of which we need to be mindful, however, is that with enough samples being evaluated, even tiny differences can be statistically significant, and furthermore, even differences that are not tiny might still not be relevant. What constitutes a "significant" increment of an analyte, in the physiological or clinical sense, is likely to vary with the situation; it is not necessarily on the same scale as a statistically significant increment. An additional point to consider, addressed below in the section on measurement error, is whether the magnitude of a statistically significant difference surpasses the magnitude of inaccuracy of measurement.

A qualitative judgment on the significance of significance, rather than simply adjusting our blinders to keep out the sun while logging p-values, is certainly worth a moment of silence on our parts. Furthermore, the converse case is also relevant; for example, what if one 's inner Hercule is suspicious that a clinical significance may be lurking behind the harmless façade of p > 0.05?. How could we reconcile the lack of statistical significance in that case? One possibility is that too few samples were evaluated to achieve the statistical power to reveal significance (discussed further later in the chapter). Another is that part of the group of samples tested reacts differently than (an) other part(s), resulting in a total effect across the disparate subgroups that does not reach statistical significance, even though it reflects a clinical situation of great significance.

The latter concept is central to the development of personalized medicine, an emerging field that seeks to improve disease treatment by targeting molecu-larly distinct subpopulations within a given group, such that response rates to treatments are increased and the administration of ineffective and potentially toxic treatments to inappropriate individuals is diminished. As an example, selection of breast cancer patients according to ErbB2 tyrosine kinase status has had a significant impact on management of the disease; moderate, but significant, response rates to ErbB2- targeted antibody therapeutics in some studies might have been overlooked had study groups not first been selected for its overexpression [6]. Some basic concepts concerning statistical methods of relevance to pharmacogenomics are discussed later, in addition to some of the common errors in their application.

Air the Error

Because statistical descriptions and inferences are based on probability, a certain amount of chance (random) error is always inherent in any statistically based conclusion. This error is routinely quantified by upper and lower confidence limits and mentioned in results sections. Unfortunately, such routine error expression seems to induce a false sense of security in readers that all the error inherent in the results has been taken into account.

In fact, there are numerous sources of nonrandom error, generally covered by the blanket term bias (i.e., when subjects, specimens, or data in the groups being compared are inherently different or are handled differently in a way that systematically introduces a signal into data for one of the groups compared). Differences such as types of tubes used or length of time until spinning or time samples are maintained frozen prior to analysis would fall into this category. Bias has been deemed such a serious problem in nonexperimental research that some experts consider such studies guilty of bias and erroneous results until proven innocent [7]. The term nonexperimental research refers to research where the effects of some perturbation to the system are not being tested, but rather, information is simply gathered (i.e., in epidemiological studies) or quantified (i.e., in laboratory analyses, such as most proteomic analysis). Bias is also a serious and often overlooked problem in experimental research [i.e., where effects of deliberate maniuplation of (a) variable(s) are tested]. A recent perspective article on how potential flaws in preclinical research may play a major role in late-phase drug failures discusses this in detail, particularly as it relates to animal models [8].

Measurement imprecision, included by some under the term bias, but by others considered separately, constitutes an additional significant source of error. In the case of continuous data quantifications, although measurement error almost always occurs, it is almost never quantified, or at least such quantification is in no way accounted for in the result reported. In the case of noncontinuous data, such as in diagnostic tests that may generate a yes/no dichotomy, measurement error can have a different but equally profound effect, potentially contributing to false positives or negatives.

Continuous Data Quantifications Every assay result is only the best approximation to the true value that a particular assay is able to produce. This is readily acknowledged by all, and, in fact, it is standard practice in drug development to quantify the inaccuracy of assays using known amounts of a reference standard. How the quantification should best be accomplished is beyond the scope of this chapter, but the total error approach wisely incorporates both the components of random error and bias/measurement error in a single measure and is discussed further elsewhere [9-11].

Despite the depth of consideration given to measurement error, there appears to be an element of lip service to its acknowledgment, since error quantifications are generally used only to determine acceptance or rejection of an assay's results, rather than to modify or qualify in any way the numerical result produced by the assay. Measurement error is thus consistently brushed under the carpet, with results often reported in ways that dramatically overstate their precision. As an end result, differences from one measure to another in some cases become infused with spurious meaning, when they actually represent nothing more than noise in the signals. One can only speculate on how many drugs that failed during late-phase trials might never have gotten that far had the measurement error in their "promising" preclinical results been accounted for in the data analysis.

Sometimes measurement error is compounded dramatically by introducing more than one erroneous measurement into a final calculation of the result, and again the norm is to ignore this inflated error rather than account for it in the results reported. Let us consider the example of a clinical trial examining the effect of a treatment on the activity of a disease biomarker in white blood cells. In this example, two assays are required: one to quantify the disease biomarker in a lysate prepared from the cells, and one to quantify the amount of protein that was culled in each cell preparation. The activity result from the first assay is then normalized against the protein result from the second assay, to give an activity per mass of protein as the final assay result. Looking more closely at the potential error in this example, we should note that standard acceptance criteria for bioassays generally include specification that a certain level of demonstrated measurement error will be considered acceptable, with data still included "as is" in the analysis. Assuming a typical acceptance criterion of up to ±20% inaccuracy, two measurements of a single identical sample of true value X could therefore quantify as differing from each other by 40% of X, and no one would raise an eyebrow. In fact, Example 1 shows how once you account for the compound error of the two assays in

Example 1. One assay for enzyme activity and one for protein content of sample are performed, with the final activity reported as per mass of protein. Assuming ±20% measurement inaccuracy in each of the two assays, we see that if patient A has a true activity of 10 activity units, acceptably accurate activity assays would reveal values of 8 to 12 units, and each milligram of protein would be quantified as 0.8 to 1.2mg. Therefore, within each assay, an identical sample could have acceptable measurement error, resulting in a 150% increase of one value relative to the other (12 units/8 units, or 1.2mg/0.8mg).

Normalizing enzyme activity units to milligrams of protein inflates this possible discrepancy between two identical sample results from 150% to a 220% apparent increase in one sample versus another; that is, the most extreme acceptable values from each assay, normalized against each other, would be 8 units/1.2 mg and 12 units/0.8 mg, for a final resulting range of 6.7 to 15 units/ mg as possible results from replicate aliquots of the same sample.

this example, two assay results showing a 220% increase in one sample's final result versus another are possible when the two actually are replicates, having an identical amount of the biomarker.

This example demonstrates the compounding of only two measurement errors. Imagine the case when more measurements go into a calculation, each having its own error: for example, when using reporter gene induction in a cell-based system as a model of induction of an in vivo response. In this case there are four measurements that need to be done to control properly, each with its own measurement error range: the reporter gene expression in trans-fected cells challenged with the experimental treatment, its expression in transfected cells challenged with only the vehicle in which the treatment compound is dissolved, and those same two conditions in cells that were trans-fected with empty vectors (as a control for the effects of the vector backbone on the system). All too often researchers using models of this sort tend to "simplify" by expressing induction of reporters as simply a fold-increase over the vehicle effect, essentially pretending that the complex and compounded error does not exist.

Noncontinuous Data Measurement error can also have an important effect on noncontinuous data, an effect which, as noted above, is discrete from random error and is rarely taken into account. In this case, misclassifications can occur, such as false negatives or positives. We can all imagine the personal impact of such misclassifications, such as being falsely diagnosed with a deadly disease, or conversely, not receiving appropriate treatment because of a false-negative test result. Misclassification errors also decrease the power of studies to detect significant differences and have the potential to obscure or even falsify study results, leading to equally devastating poststudy impact on a much wider scale.

For example, successful efforts to develop an effective vaccine for malaria, considered to be the deadliest pediatric disease, have been hampered by measurement errors leading to misclassifications. In this case, the fact that vaccine efficacy (VE) is a ratio of the number of infected, vaccinated persons over the number of infected, nonvaccinated persons gives rise to the misconception that false-positive or false-negative diagnosis in the numerator and denominator will cancel out or balance. Actually, since measurement error increases as the lower limit of detection is approached, a partial effect of vaccination causing lower-level signals at the time of measurement (such as a slowing of parasite growth, the biomarker for VE), can lead to systematic error in the measurement of the test group that is not matched in the control group, and an overestimation of VE [12].

The end effects of misclassification error of this sort can vary substantially, depending not only on the goal of the testing and the decisions that are dependent on it, but also on the overall frequency of the disease being studied. For instance, elevated false-positive rates for a test may have a huge impact in the case of rare diseases, and conversely, elevated false-negative rates may have a bigger impact with very common diseases. Example 2 shows that if a test for rare disease X (occurring in 1 in 1000 people) gives 99% accurate positive results in people who truly have the disease and 95% accurate negative results in people who truly do not have it, any positive test result it gives will be false in 98% of the cases.

Example 2. A diagnostic test has the following known accuracy: It returns a positive result in 99% of patients who truly have the disease and a negative result in 95% of patients who truly do not have the disease. The disease is rare, occurring in the general population at a rate of only 1/1000 people. Although you have no known predisposition or reason to think you have the disease, you decide to get tested and are shocked to learn that the result has come back positive. What are your chances that the diagnosis is actually a false positive?

False positive probability = 1 - true positive probability probability of true positives

True positive probability =

total of true and false positive probabilities 0.99 x 0.001

False positive probability = 1 - 0.019 = 0.981 Your chances are about 98% that the diagnosis of positive is false.

The test would have to be 99.9% accurate for negative results before it would have a zero-rated false-positive detection rate. Whereas, intuitively, one might think of 95% and 99.9% accuracy as both sounding pretty good, you can see that the seemingly small gap between them could have a surprisingly important impact.