One Sided Tests
Gerard E. Dallal, PhD
One common criticism of significance tests is that no null hypothesis is ever true. Two population means or propportions are always unequal as long as measurements have been carried out to enough decimal places. Why, then, should we bother testing whether the means are equal? The answer is contained in a comment by John Tukey regarding multiple comparisons: The alternative hypothesis says we are unsure of the direction of the difference. In keeping with Tukey's comment, tests of the null hypothesis that two population means or proportions are equal
are almost always two-sided (or two-tailed*). That is, the alternative hypothesis is
which says that the difference between means or proportions can be positive or negative.
Every so often, someone claims that a difference, if there is one, can be in only one direction. For example, an investigator might claim that newly proposed treatment N must be at least as good as the standard treatment, S. It cannot be worse, especially when the "Standard" is a placebo. One-sided tests have been proposed for such circumstances. Suppose small values are good, that is, the goal of the treatment is to produce small values of something like cholesterol, blood pressure, or weight. The null hypothesis of equal effectiveness is
The alternative hypothsis states that the difference can be in only one direction
For example, an investigator might propose using a one-tailed test to test the efficacy of a cholesterol lowering drug because the drug cannot raise cholesterol. With a one-tailed test, the hypothesis of no difference is rejected if and only if the subjects taking the drug have cholesterol levels significantly lower than those of controls. Outcomes in which subjects taking the drug have cholesterol levels higher than those of controls are treated as failing to show a difference no matter how much higher they may be.
One-tailed tests make it easier to reject the null hypothesis when the alternative is true. A large sample, two-sided, 0.05 level t test puts a probability of 0.025 in each tail. It needs a t statistic of less than -1.96 to reject the null hypothesis of no difference in means. A one-sided test puts all of the probability into a single tail. It rejects the hypothesis for values of t less than -1.645. Therefore, a one-sided test is more likely likely to reject the null hypothesis when the difference is in the expected direction. This makes one-sided tests very attractive to those whose definition of success is having a statistically significant result.
What damns one-tailed tests in the eyes of most statisticians is the demand that all differences in the unexpected direction--large and small--be treated as simply nonsignificant. I have never seen a situation where researchers were willing to do this in practice. In practice, things can always get worse! Suppose subjects taking the new cholesterol lowering drug ended up with levels 5010 mg/dl higher than those of the control group. The use of a one-tailed test implies that the researchers would chalk it up to random variation and pursue it no further. However, we know they would immediately begin looking for an underlying cause and question why the drug was considered for human intervention trials.
A case in point is the Finnish Alpha-Tocopherol, Beta-Carotene Cancer Prevention Trial ("The Effect Of Vitamin E and Beta-Carotene on the Incidence of Lung Cancer and other Cancers in Male Smokers" N Engl J Med 1994;330:1029-35). There were 18% more lung cancers diagnosed and 8% more overall deaths in study participants taking beta carotene. If a one-sided analysis had been proposed for the trial, these results would have been ignored on the grounds that they were the result of unlikely random variability under a hypothesis of no difference between beta-carotene and placebo. When the results of the trial were first reported, this was suggested as one of the many possible reasons for the anomolous outcome. However, after these results were reported, investigators conducting the Beta Carotene and Retinol Efficacy Trial (CARET), a large study of the combination of beta carotene and vitamin A as preventive agents for lung cancer in high-risk men and women, terminated the intervention after an average of four years of treatment and told the 18,314 participants to stop taking their vitamins. Interim study results indicate that the supplements provide no benefit and may be causing harm. There were 28% more lung cancers diagnosed and 17% more deaths in participants taking beta carotene and vitamin A than in those taking placebos. Thus, the CARET study replicated the ATBC findings. More details can be found in this NIH fact sheet and this one, too.
It is surprising to see one-sided tests still being used in the 21-st century, even in a journal as reknowned as the Journal of the American Medical Association. The study by Graat et al. (JAMA, Volume 288(6). August 14, 2002.715-721). "Effect of Daily Vitamin E and Multivitamin-Mineral Supplementation on Acute Respiratory Tract Infections in Elderly Persons: A Randomized Controlled Trial" provides a perfect illustration of how one-sided tests can leave an investigator..chagrinned. The Statistical Analyses section (p 717) contains the comment, "Although the initial sample size was based on a 1-sided test on the assumption that effects would only be seen in 1 direction, after the study was completed the need for 2-sided tests became evident. P values are therefore based on 2-sided tests." One does have to admire the investigators for their honesty.
The usual 0.05 level two-tailed test puts half of the probabilty (2.5%) in each tail of the reference distribution, that is, the cutoff points for the t statistic are 1.96. Some analysts have proposed two-sided tests with unequal tail areas. Instead of having 2.5% in each tail, there might be 4% in the expected direction and 1% in the other tail (for example, cutoffs of -1.75 and 2.33) as insurance against extreme results in the unexpected direction. However, there is no consensus or obvious choice for the way to divide the probability (e.g., 0.005/0.045, 0.01/0.04, 0.02/0.03) and some outcomes might give the false impression that the split was chosen after the fact to insure statistical signifcance. This leads us back to the usual two-tailed test (0.025, 0.025).
Marvin Zelen dismisses one-sided tests in another way--he finds them unethical! His argument is as simple as it is elegant. Put in terms of comparing a new treatment to standard, anyone who insists on a one-tailed test is saying the new treatment cannot do worse than the standard. If the new treament has any effect, it can only do better. However, if that's the case right at the start of the study, then it is unethical not to give the new treatment to everyone!
-------------
*Some statisticians find the word
tails to be ambiguous and use sided instead. Tails
refers to the distribution of the test statistic and there can be many
test statistics. While the most familiar test statistic might lead to a
two-tailed test, other statistics might not. When the hypothesis
H0: 1 = 2 is tested against the alternative
of inequality, it is rejected for large positive values of t (which lie
in the upper tail) and large negative values of t (which lie in the lower
tail). However, this test can also be performed by using the square of
the t or z statistics (t2 = F1,n; z2 =
21).
Then only large values of the test statistic will lead to rejecting the
null hypothesis. Since only one tail of the reference distribution leads
to rejection, it is a one-tailed test, even thought the alternative
hypothesis is two-sided.
Side refers to the hypothesis, namely, on which the side of 0
the difference 1 - 2 lies (positive or negative).
Since this is a statement about the hypothesis, it is independent of the
choice of test statistic. Nevertheless, the terms two-tailed and
two-sided are often used interchangeably.