## Statistical Analysis And Power

Threaded through all the above issues are implicit considerations of statistical analysis and power. Every single decision about research design has some impact on the appropriate choice of analysis and on the power to detect clinically significant effects. In fact, the major difference between well-designed and well-executed present-day RCTs and those done 50 years ago stems from advances in methods of statistical analysis of results and better understanding of the concept and application of power in designing RCTs.

With the simplest possible design, randomly assigning a representative sample of patients to a treatment (T) or control (C) condition, with a binary primary outcome—"success" versus "failure"—the analytic method would be a 2 x 2 (treatment x outcome) X2 test. This is the least powerful design (Cohen 1983; Kraemer 1991; Kraemer and Thiemann 1987, 1989; MacCallum et al. 2002), and the study thus requires perhaps twice, perhaps 10 times, as many patients for adequate power compared with studies with other designs. A valid choice? Yes. A wise choice? No.

Suppose we merely substituted that binary primary outcome with a dimensional one—for example, symptom level at the end of treatment—and proposed to use the most common RCT analysis method: the two-sample t test. Immediately, there would be an increase of power (thus requiring a smaller sample size for adequate power). If the goal is to detect a moderate effect (number needed to treat [NNT] »4), for a 5% two-tailed test one would require 63 patients per group or a total of 126 patients. Detecting a small effect (NNT »9) would require 389 patients per group, and detecting a large effect (NNT »2) would require only 26 patients per group.

The two-sample t test is valid when the outcomes being measured are approximately normally distributed with equal variances in the two groups, but many clinically meaningful outcomes have asymmetrical distributions, have long tails, or occur with unequal variances in the two groups. Then one might use instead the nonparametric Wilcoxon rank sum test (Mann-Whitney test). When the t test gives valid results, the Mann-Whitney test is also valid and has quite similar power. However, the nonparametric test is valid in many circumstances when the t test is not. This illustrates two general principles: the choice of the outcome measure must be in accordance with the choice of the analytic procedure, and better selection of the outcome measure has a major impact on the study's power and thus the necessary sample size.

If one uses any of the above for an RCT of depressed patients, most of the sample would be female; for an RCT of schizophrenic patients, most of the sample would be male. In many cases, it is proposed to stratify such sample populations in order to equalize the representation of males and females. Is this wise?

If it is decided that stratification is warranted, to be valid the analytic procedure must acknowledge that stratification. For a binary outcome, one might use a logistic regression analysis and for a dimensional outcome, a linear regression analysis, with treatment group, stratum (here, gender), and their interaction as independent variables in each case.

One common analytic error is to assume that the interaction does not exist and to use analysis of covariance with gender as the covariate. If that interaction does exist in the study population and is ignored in the analysis, it often compromises the significance level, and thus the validity of the test, and almost inevitably reduces the power. But if the interaction is included in the analysis, care must be taken to properly center all the independent variables to produce clinically interpretable results (Kraemer and Blasey 2004).

If this is done correctly, the interaction test assesses whether the treatment effect in women is different from the effect in men, and the main effect of treatment assesses whether the average treatment effect across men and women is nonrandom. When there is an interaction effect in the study population, the main effect of treatment in this analysis is not the same as the effect of treatment assessed in an unstratified sample. The crucial issue then is which treatment effect is of interest—the effect in the total population or the average effect across the subpopulations defined by the strata.

If the decision is that the sample should be stratified, the sample size needed for adequate power is likely to increase, and the logistical difficulty of accumulating a stratified sample is likely to be much greater. If, for example, 80% of those with the disorder of interest are women, but it is decided that 50% of the sample in the RCT should be women, one will have to work much harder to recruit that oversampling of men into the study. Thus, careful thought should be given to whether the rationale and justification for stratification are strong enough to necessitate larger samples, more complex analyses, and a shift in the hypothesis being tested.

The difficulty of such decisions is exacerbated when researchers (or reviewers) seek to control for multiple covariates (e.g., gender, age, ethnicity, initial severity of the disorder). To truly control the study for the effects of such variables, one stratifies the sample. However, with gender (two possibilities), age (say, five age groups), ethnicity (say, five ethnic groups), and initial severity (say, three levels), one has 2 x 5 x 5 x 3 = 150 strata, and one would have to recruit adequate numbers into each stratum (for optimal power, an equal number into each stratum). If even a minimal number of patients per stratum were specified, say, 10 per stratum (5 randomized to T and 5 to C), the minimal sample size would be 1,500!

If these stratification variables are not very strongly associated with treatment effect, the result is a study with less power than would be achieved with a simple design. If there are collinearities among these variables (say, women and older patients tend to have high higher initial disease severity), the power to detect treatment effects might also be reduced. One of the least wise decisions in RCT design is to try to control for the effects of too many variables, and many experienced biostatisticians argue against any stratification of the sample unless the primary hypotheses concern moderators of treatment outcome.

Researchers and review committees, however, often propose another tactic: Instead of controlling for the effects of these baseline variables through stratification, adjust for them in a mathematical model. Now the sample would continue to be 80% women, but the analysis would include consideration of both treatment and gender. What then often happens in analysis is exclusion of all interactions. Without adjustment, the two-sample t test has N - 2 degrees of freedom (the larger the number of degrees of freedom, other things being equal, the greater the power), but with a single covariate, that becomes N - 4 (N - 22), and with four covariates, it becomes N - 32 (N - 25). As noted above, if such interactions exist in the study population and are excluded in the model, the significance level may be compromised and power is almost inevitably lost. Thus, if covariates are to be included, their interactions must be as well. Unless inclusion of those variables has a major strengthening effect on effect size, this inevitably means a loss of power. Finally and perhaps most important, collinearity effects resulting from associations between the variables included cost even more power. Again, most experienced biostatisticians argue against adjusting for the effects of baseline variables in the absence of a strong rationale and empirical justification for doing so.

On the other hand, in a multisite RCT, stratification by site is built into the design and must be included in the analysis, and even then, many researchers and reviewers choose to ignore it. Multisite RCTs often show that site differences are a major source of variance in the outcome measurements (MTA Cooperative Group 1999). The most convincing demonstration of the almost-ubiquitous nature of site differences is not from an RCT but from a study of inbred strains of mice in a genetics study (Crabbe et al. 1999) under controlled laboratory conditions. Even then, site differences occurred. In an RCT, if samples are drawn from different sites, or in different time spans, or at the same site at the same time but using different recruitment strategies (e.g., referrals from doctors versus responses to advertisement), one should always expect that these differences will affect the primary outcome. Thus, randomization must be done within each such stratum, and comparison of T versus C must be a pooled comparison of the within-stratum comparisons of T versus C (Kraemer and Robinson 2005).

Thus far we have focused on assessing a single primary outcome at the end of treatment, whether a binary success/failure or a dimensional outcome, and have recommended against using a binary outcome. But, some would argue, some outcomes are by their nature binary: either the patient dies or not, the patient recovers or not, the disease remits or not. Are we not then obliged to use a binary outcome, and "take the hit" by increasing the sample size manyfold?

Outcomes such as these occur over the course of time, and at different times for different patients. By simply reorienting the analysis to examination of the time to the event, one moves from a binary outcome to a dimensional one. This is the situation in which survival analyses become the analytic procedure of choice: Kaplan-Meier estimation-of-survival curves within each group (Kaplan and Meier

1958), comparison of these survival curves in the T versus C groups, and use of the Cox proportional model (Andersen et al. 1985), for example, when there are strata or covariates to be considered. Although the sample sizes for adequate power will be somewhat greater than with other dimensional outcomes (because some patients will be censored, i.e., they will not have had the outcome occur before the end of study), the sample size here will be much smaller than when using a binary outcome, and more useful clinical information will be obtained.

Also, with dimensional outcome measurement, modern analytic tools can lead to increased power without increasing sample size. For example, instead of assessing the outcome using only the endpoint of treatment, one could assess that outcome measure at baseline and at fixed times during the treatment period. Random regression models (also known as hierarchical models, or growth curves; Berger 1986; deLeeuw and Kreft 1986; Ware 1985) basically model the trajectory of response within each patient and then test whether the trajectories of response in the T group are clinically preferable to those in the C group. Because multiple measures per patient are used to characterize each patient's response, reliability is increased, and thus power is increased. Moreover, in the case of missing data or dropouts, partial data on the trajectory per patient enable stronger imputation methods to facilitate intention-to-treat analyses. Quite aside from the multiple statistical advantages of designing studies with repeated measures of outcomes over time, such information is often clinically informative in guiding clinicians to recognize early those patients who are unlikely to ever respond to a given treatment.

This discussion barely scratches the surface of analytic methods available, but illustrates two general principles:

1. For adequate power and to best inform clinical decision making, characterize the response of each individual patient as precisely and concisely as possible (using reliable measures, preferably dimensional, with repeated measures over time). That might sometimes complicate the analysis, but analytic methods are generally available to take advantage of such precision.

2. Design the study to answer the primary research question, not to answer all possible questions that might arise. Leave those to secondary or exploratory post hoc analyses. Do not stratify the study population unless the design requires multiple sites or recruitment sources or the primary research question is about the strata. Do not try to control or adjust for all possible influences on treatment effect; instead, focus on controlling those factors empirically shown to strongly influence treatment effect.

## Nicotine Support Superstar

Stop Nicotine Addiction Is Not Easy, But You Can Do It. Discover How To Have The Best Chance Of Quitting Nicotine And Dramatically Improve Your Quality Of Your Life Today. Finally You Can Fully Equip Yourself With These Must know Blue Print To Stop Nicotine Addiction And Live An Exciting Life You Deserve!

Get My Free Ebook