Announcement

Some Aspects of Study Design
Gerard E. Dallal, Ph.D.

Introduction

100% of all disasters are failures of design, not analysis.
-- Ron Marks, Toronto, August 16, 1994
To propose that poor design can be corrected by subtle analysis
techniques is contrary to good scientific thinking.
-- Stuart Pocock (Controlled Clinical Trials, p 58) regarding the use of retrospective adjustment for trials with historical controls.
Issues of design always trump issues of analysis.
-- GE Dallal, 1999, explaining to a client why it would be wasted effort to focus on the analysis of data from a study whose design was fatally flawed.
Bias dominates variability.
-- John C. Bailler, III, Indianapolis, August 14, 2000
Statistics is not just a collection of computational techniques. It is a way of thinking about the world. Anyone can take a set of numbers and apply formulas to them. There are many computer programs that will do the calculations for you. But there is no point to analyzing data from a study that was not properly designed to answer the research question under investigation. In fact, there's a real point in refusing to analyze such data lest faulty results be responsible for implementing a program or policy contrary to what's really needed.

Two of the most valuable things a researcher can possess are knowledge of the principles of good study design and the courage to refuse to cut corners (to make a study more attractive to a funding agency or more convenient for the researcher, or to please one's employer, for example).

There are some general principles of study design that can be offered. However, the specifics can only be learned anecdotally. Every field of study has its peculiarites. Some things that are major issues in one field may never be encountered in another. Many of the illustrations in these notes are nutrition related because that's where I've done most of my work. If you ask around, others will be only too happy to share horror stories from their own fields.

The Basics of Study Design

(1) There must be a fully formed, clearly stated, focused research question and SINGLE primary outcome measure. You know you've done it right if it all comes down to a single outcome measurement on each observational unit (usually, an individual). There are problems if there are many numbers, either because there are many ways to ask the question or many ways to quantify a response. The problems are so severe that I despair over the way statistics is being widely misused today, when anyone can get a statistics program and be his/her own analyst. Often it doesn't take a full-fledged statistics program but only a spreadsheet program to carry out the mayhem.

The best way to get to the focused research question and primay outcome measure is to ask

Focus is critical. There must be a clear goal in mind. Otherwise, time, energy, and money will be invested only to find that nothing has been accomplished. A useful approach is to ask, at the start of the design phase, if only one question could be answered by the project, what would that question be? (Other variations are, "If the results were summarized at the top of the evening news in a phrase or two spoken in a few seconds, what would the reporter say?" or "If the results were written up in the local newspaper, what would the headline be?") Not only does this help a statistician better understand an investigator's goals, but sometimes it forces the investigator to do some serious soul-searching.

It was stated in Is Statistics Hard? that classical statistics works by comparing study data to what is expected when there is nothing. If the data are not typical of what is seen when there is nothing, there must be something! Usually "not typical" means that some summary of the data is so extreme that it is seen less than 5% of the time when there is nothing. This is where trouble creeps in. Many researchers learn what happens less than 5% of the time in one situation and then apply it incorrectly to other situations. They think they are seeing something that's "not typical" when they are looking at something very typical! This is how disaster strikes.

The commonly used statistical methods were developed for the researcher with a carefully crafted research question and a clearly specified response measure. However, these methods work against the investigator when the question is vague or there are many ways to address it. When the research question is vague or there are many possible response measures, researchers invariably "look around" and perform many evaluations of the data. The chance that any single evaluation will hint at an effect is small. However, when many evaluations are performed, the same statistical methods now guarantee that something will suggest an effect when there is none if that something is treated as though it were the only evaluation that was performed. (On average, 5% of such evaluations will suggest that there is an effect.) Unfortunately, it is not uncommon for an investigator to perform many evaluations and focus only on the ones that support a difference while ignoring the many suggesting "no difference". But, the ones that seem to support a difference do so only when there aren't a bunch of other evaluations saying "no difference".

This illustrates that a result that is too extreme to be typical behavior in one situation is typical behavior in another set of circumstances. The trick to being a competent analyst is to recognize when a result is typical and when it isn't.

Another (almost certainly apocryphal) example involves a student accused of cheating on a standardized test. Her scores had increased 200 points between two successive tests over a short period of time. The testing organization withheld the new grade, so the student took them to court to force them to release it. The hearing is supposed to have gone something like this:

Had the proctor suspected cheating, then it would have been quite a coincidence for that student to be the 1 out of 50,000 to have such a rise, but it is NOT surprising that it happened to someone, somewhere when 50,000 took a retest. Once again, a result that is too extreme to be typical behavior in one case is typical behavior in another set of circumstances The chances of winning a lottery are small, yet there's always a winner.

The same thing applies to research questions. Suppose an investigator has a theory about a specific causal agent for a disease. If the disease shows an association with the causal agent, her theory is supported. However, if the same association is found only after sifting through dozens of possible agents, the amount of support for that agent is greatly diminished. Once again, statistical theory says that if enough potential causal agents are examined, a certain proportion of those unrelated to the disease will seem to be associated with the disease if one applies the criterion appropriate for a fully-formed research question regarding a single specified causal agent.

To summarize:

Be skeptical of reported results that were not part of a study's original goals. Sometimes they are important, but often they are an attempt to justify a study that did not work out as hoped or intended. When many responses and subgroups are examined, statistical theory guarantees that some of them will appear to be statistically significant on the surface, These results should not be given the same status as the results from analyses directed toward the primary research question.

As a member of a Scientific Review Committee, I could not approve your proposal if it did not contain both a fully formed research question and a clearly stated outcome measure.

But what if, when all is said and done, you can't come up with a single outcome measure? Perhaps the field of research is too new or uncharted, so that electing a single outcome would be bad science. In that case, there are three options, not including

  1. Ignore the problem. Uhm...no!
  2. This is a bit like (0). 'Fess up. Admit that the area is uncharted and that what your are doing is a kind of a pilot, exploratory study.
  3. Think harder! Maybe there is a natural single outcome. Maybe it's the sum of your outcomes, or their maximum, or some other transformation of them.
  4. Adopt a method of analysis accounts for the multiple outcomes. One such method is the Bonferroni adjustment. Suppose the standard for "extreme" is the type of result that occurs only 5% of the time if there is no effect. If there are m reasonable outcome measures, the Bonferroni adjustment looks to see whether any exhibit the type of behavior that occurs only (5/m)% of the time if there is no effect. For example, with 5 outcome variables, an effect could be declared if any one of them gave the type of extreme result that occurred only 1% [=5%/5] of the time if there is no effect.

One type of problem I've been noticing lately is the single, composite primary response variable that combines outcome measures of such different degrees of importance that the composite makes little sense. It is typified by a report in the Wall Stree Journal of Tuesday, September 2, 2008--"For Sick Heart Patients, Bypass Surgery Beats Stents"-- regarding the SYNTAX Study:

After one year, 17.8% of stent patients had died, suffered a stroke or heart attack, or had to return for another operation. That figure compared with only 12.1% of bypass patients.

The big difference was a higher need for repeat procedures in patients who received stents; the one-year death rate for the two groups was essentially identical (7.7% for bypass patients, and 7.6% for stent patients.)

The Journal astutely points out that

It's important to point out that getting a stent is a less daunting procedure than bypass surgery, in which doctors crack open the chest.

Having had 3 angiograms myself, I know first hand that an angiogram is a walk in the park compared to a bypass. One has to wonder why the investigators would choose a measure that combined less serious outcomes with more serious outcomes, making it almost meaningless and resulting in a adverse outcome rates of 17.8% to 12.1% where the procedure withe the 17.8% rate may still be the preferable choice.

(2) The project must be feasible. This refers not only to resources (time and money), but also to whether there is agreement on the meaning of the research question and to whether everything that needs to be measured can be measured. What follows is a sampling of the sort of questions that must be addressed when judging a project's feasibility.

We may think we have a well-specified hypothesis or outcome variable, but do we?

If the study involves some condition, can we define it? Can we be sure we'll recognize it when we see it?

How accurate are the measurements? How accurate do they need to be? What causes them to be inaccurate?

How do we choose among different measurement techniques?

Are we measuring what we think we're measuring?

Can measurements be made consistently, that is, if a measurement is made twice will we get the same number? Can others get the same value (inter-laboratory, inter-technician variability)? What happens if different measurement techniques are used within a particular study (the x-ray tube breaks, the radioactive source degrades, supplies of a particular batch of reagent are exhausted)?

Sometimes merely measuring something changes it in unexpected ways.

Sometimes the answers to these questions say that a study should not be attempted. Other times the issues are found to be unimportant. Many of these questions can be answered only by a subject matter specialist. A common mistake is to go with one's instinct, if only to keep personnel costs down. However, it is essential to assemble a team with the appropriate skills if only to convince funding agencies that their money will be well spent. Even more important, the study will lack credibility if people with critical skills were not involved in its planning and execution.

A note about bias: Bias is an amount by which all measurements are deflected from their true value. For example, a particular technician might produce blood pressure readings that are consistently 5 mm higher than they should be. If this technician makes all of the measurements, then changes over time and differences between groups can be estimated without error because the bias cancels out. In similar fashion, food frequency questionnaires might underestimate total energy intake, but if they underestimate everyone in the same way (whatever that means!), comparisons between groups of subjects will still be valid.

If a technician or method is replaced during a study, some estimates become impossible while others are unaffected. Suppose one technician takes all of the baseline measurements and a second takes all of the followup measurements. If one technician produces biased measurements, we cannot produce valid estimates of a group's change over time. However, we can reliably compare the change over time between two groups, because again the bias cancels out. (If it were important to estimate the individual changes over time--that is, account for possible bias between technicians or measuring devices--the two technicians might be asked to analyze sets of split samples in order to estimate any bias that might be present.)

(3) Every data item and every facet of the protocol must be carefully considered.

Be sure the cost in time and effort of each item is clearly understood. I suspect the urge to collect marginally related data comes from a fear that something might be overlooked ("Let's get it now, while we can!"), but another study can usually be conducted to tie up promising loose ends if they're really that promising. In general, if there is no solid analysis plan for a particular piece of data, it should not be collected. It may be unethical to collect data for which there is no analysis plan, depending on the burden it places on research subjects.

Treatments must be clearly identified.

Is the active ingredient what you think/hope it is or was the infusing instrument contaminated? Was there something in the water? In essence, anything that is done to subjects may be responsible for any observed effects, and something must be done to rule out those possibilities that aren't of interest (for example, socialization and good feelings or heightened sensitivity to the issue being studied that come from participation). Things aren't always what they appear to be. It is not unheard of for pills to be mislabeled, for purported placebos to contain substances that affect the primary outcome measurement (always assay every lot of every treatment), or for subjects to borrow from each other.

Sometimes convenience can make us lose sight of the bigger picture. Is there any point in studying the effects of a nutritional intervention or any health intervention in a population that starts out healthy? Is it ethical to study an unhealthy population, or must you treat them, instead? (There's a whole course in research ethics here!)

(4) Keep it simple!

With the advances in personal computing and statistical program packages, it often seems that no experiment or data set is too complicated to analyze. Sometimes researchers design experiments with dozens of treatment combinations. Other times they attempt to control for dozens of key variables. Sometimes attempts are made to do both-- study dozens of treatment combinations and adjust for scores of key variables! It's not that it can't be done. Statistical theory doesn't care how many treatments or adjustments are involved. The issue is a practical one. I rarely see studies with enough underlying knowledge or data to pull it off.

The aim of these complicated studies is a noble one--to maximize the use of resources--but it is usually misguided. I encourage researchers to study only two groups at once, if at all possible. When there are only two groups, the research question is sharply focused. When many factors are studied simultaneously, it's often difficult to sort things out, especially when the factors are chosen haphazardly. (Why should this treatment be less effective for left-handed blondes?) Just as it's important to learn to crawl before learning to walk, the joint behavior of multiple factors should be tackled only after gaining a sense of how they behave individually. Besides, once the basics are known, it's usually a lot easier to get funding to go after the fine details!

That's not to say that studies involving many factors should be never be attempted. It may be critically important, for example, to learn whether the advantage of one treatment over another is different for men and women. However, there should be a strong, sound, overarching theoretical basis if a study of the joint behavior of multiple factors is among the first investigations proposed in a new area of study.

By way of example: Despite all of the advice you see today about the importance of calcium, back in the 1980s there was still some question about the reason older women had brittle bones. Many thought it was due to inadequate calcium intake, but others suggested that older women's bodies had lost the ability to use dietary calcium to maintain bone health. Studies up to that time had been contradictory. Some showed an effect of supplementation; others did not. Dr. Bess Dawson-Hughes and her colleagues decided to help settle the issue by keeping it simple. They looked only at women with intakes of less than half of the recommended daily allowance of calcium. The thought was that if calcium supplementation was of any benefit, this group would be most likely to show it. If calcium supplements didn't help these women, they probably wouldn't help anyone. They found a treatment effect and went on to study other determinants of bone health, such as vitamin D. However, they didn't try to do it all at once.

(5) Research has consequences!

Research is usually conducted with a view toward publication and dissemination. When results are reported, not only will they be of interest to other researchers, but it is likely that they will be noticed by the popular press, professionals who deal with the general public, and legislative bodies--in short, anyone who might use them to further his/her personal interests.

You must be aware of the possible consequences of your work. Public policy may be changed. Lines of inquiry may be pursued or abandoned. If a program evaluation is attempted without the ability to detect the type of effect the program is likely to produce, the program could become targeted for termination as a cost-savings measure when the study fails to detect an effect. If, for expediency, a treatment is evaluated in an inappropriate population, research on that treatment may improperly come to a halt or receive undeserved further funding when the results are reported.

One might seek comfort from the knowledge that the scientific method is based on replication. Faulty results will not replicate and they'll be found out. However, the first report in any area often receives special attention. If its results are incorrect because of faulty study design, many further studies will be required before the original study is adequately refuted. If the data are expensive to obtain, or if the original report satisfies a particular political agenda, replication may never take place.

The great enemy of the truth is very often not the lie--deliberate, contrived and dishonest--but the myth--persistent, persuasive and unrealistic. --John F. Kennedy

These basic aspects of study design are well-known, but often their importance is driven home only after first-hand experience with the consequences of ignoring them. Computer programers say that you never learn from the programs that run, but only from the ones that fail. The Federal Aviation Administration studies aircraft "incidents" in minute detail to learn how to prevent their recurrence. Learn from others. One question you should always ask is how a study could have been improved, regardless of whether it was a success or a failure.

Three useful references that continue this discussion are

Types of Studies

Different professions label studies in different ways. Statisticians tend to think of studies as being of two types: observational studies and intervention trials. The distinction is whether or not the investigator has intervened to determine some of the conditions under which subjects will be studied. If there's an intervention--assigning subjects to different treatments, for example--it's an intervention trial (sometimes, depending on the setting, called a clinical trial). If there's no intervention-- that is, if subjects are merely observed--it's an observational study.

Observational Studies

Take a good epidemiology course! Statisticians tend to worry about issues surrounding observational studies in general terms. Epidemiologists deal with them systematically and have a name for everything! There are prospective studies, retrospective studies, cohort studies, nested case-control studies, among others.

Epidemiologists have also developed excellent terminology for describing what can go wrong with studies.

Statisticians tend to spell out issues in all their glory while epidemiologists capture them in a single phrase. I suspect that this is because epidemiologists spend more time studying bias where statisticians spend more time studying variability. Learn the terminology. Take the course.

Surveys

The survey is a kind of observational study because no intervention is involved. The goal of most surveys is to examine a sample of individuals in order to make statements about the population from which the sample was drawn. For example,

The analysis of survey data relies on samples being random samples from the population. The methods discussed in an introductory statistics course are appropriate when the sample is a simple random sample. The formal definition says a sample is a simple random sample if every possible sample had the same chance of being drawn. A less formal-sounding but equally rigorous definition says to draw a simple random sample, write the names of everyone in the population on separate slips of paper, mix them thoroughly in a big box, close your eyes, and draw slips from the box to determine whom to interview.

When isn't a random sample simple? Imagine having two boxes--one for the names of public high school students, the other for the names of private high school students. If we take separate random samples from each box, it is a stratified random sample, where the two strata are the two types of high schools. In order to use these samples to make a statement about all Boston area students, we'd have to take the total numbers of public and private school students into account, but that's a course in survey sampling and we won't pursue it here.

Sometimes the pedigree of a sample is uncertain, yet statistical techniques for simple random samples are used regardless. The rationale behind such analyses is best expressed in a reworking of a quotation from Stephen Fienberg (Applied Statistics, 18(1969), 159), in which the phrases contingency table and multinomial have been replaced by survey and simple random:

"It is often true that data in a [survey] have not been produced by a [simple random] sampling procedure, and that the statistician is unable to determine the exact sampling scheme which was used. In such situations the best we can do, usually, is to assume a [simple random] situation and hope that it is not very unreasonable."
This does not mean that sampling issues can be ignored. It says that in some instances we may decide to treat data as though they came from a simple random sample as long as there's no evidence that such an approach is inappropriate.

Why is there such concern about the way the sample was obtained? With only slight exaggeration, if a sample isn't random (if its selection does not involve some probability device), statistics can't help you! We want samples from which we can generalize to the larger population. Some samples have obvious problems and won't generalize to the population of interest.

If we were interested in the strength of our favorite candidate in the local election we wouldn't solicit opinions outside her local headquarters. If we were interested in general trends in obesity, we wouldn't survey just health club members. But, why can't we just measure people who seem, well...reasonable?

We often think of statistical analysis as a way to estimate something about a large population. It certainly does this. However, the real value of statistical methods is their ability to describe the uncertainty in the estimates, that is, the extent to which samples can differ from the populations from which they are drawn. For example, suppose in random samples of female public and private high school students 10% more private school students have eating disorders. What does this say about all public and private female high school students? Could the difference be as high as 20%? Could it be 0.1%, with the observed difference being "just one of those things"? If the samples were drawn by using probability-based methods, statistics can begin to answer these questions. If the samples were drawn in a haphazard fashion or as a matter of convenience (the members of the high school classes of two acquaintances, for example, or the swim teams) statistics can't say much about the extent to which the sample and population values differ.

The convenience sample--members of the population that are easily available to us--is the antithesis of the simple random sample. Statistics can't do much beyond providing descriptive summaries of the data because the probability models that relate samples to populations do not apply! Let me repeat that:

STANDARD PROBABILITY MODELS
DO NOT APPLY
TO CONVENIENCE SAMPLES!

It may still be possible to obtain useful information from such samples, but great care must be exercised when interpreting the results. One cannot simply apply standard statistical methods as though simple random samples had been used. You often see comments along the lines of "These results may be due to the particular type of patients seen in this setting." Just look at the letters section of any issue of The New England Journal of Medicine.

Epidemiologists worry less about random samples and more about the comparability of subjects with respect to an enrollment procedure. For example, suppose a group of college students was recruited for a study and classified as omnivores or vegans. There is no reason to expect any statement about these omnivores to apply to all omnivores or that a statement about these vegans to apply to all vegans. However, if subjects were recruited in a way that would not cause omnivores and vegans to respond to the invitation differently, we might have some confidence in statistical analyses that compare the two groups, especially if differences between omnivores and vegans in this college setting were seen in other settings, such as working adults, retirees, athletes, and specific ethnic groups.

Cross-sectional studies vs longitudinal studies

Imagine a lot of lines on a graph with

A cross-sectional study involves a group of people observed at a single point in time by taking a slice or cross-section at a particular point in time.)

A longitudinal study involves the same individuals measured over time (or along the time line).

It is often tempting to interpret the results of a cross-sectional study as though they came from a longitudinal study. Cross-sectional studies are faster and cheaper than longitudinal studies, so there's little wonder that this approach is attractive. Sometimes it works; sometimes it doesn't. But, there's no way to know whether it will work simply by looking at the data.

When faced with a new situation, it may not be obvious whether cross- sectional data can be treated as though they were longitudinal. In cross- sectional studies of many different populations, those with higher vitamin C levels tend to have higher HDL-cholesterol (the so-called "good" cholesterol) levels. Is this an anti-oxidant effect, suggesting that an increase in vitamin C will raise HDL-cholesterol levels and that the data can be interpreted longitudinally, or are both measurements connected through a non-causal mechanism? Perhaps those who lead a generally healthy life style have both higher HDL-cholesterol and vitamin C levels. The only way to answer a longitudinal question is by collecting longitudinal data.

Even longitudinal studies must be interpreted with caution. Effects seen over the short term may not continue over the long term. This is the case with bone remodeling where gains in bone density over one year are lost over a second year, despite no obvious change in behavior.

Cohort Studies / Case-Control Studies

[This discussion is just the tip of the iceberg. To examine these issues in depth, find a good epidemiology course. Take it!]

In cohort studies, a well-defined group of subjects is followed. Two well-known examples of cohort studies are the Framingham Heart Study, which follows generations of residents of Framingham, Massachusetts, and the Nurses Health Study in which a national sample of nursing professionals is followed through yearly questionnaires. However, cohort studies need not be as large as these. Many cohort studies involve only a few hundred or even a few dozen individuals. Because the group is well-defined, it is easy to study associations within the group, such as between exposure and disease. However, cohort studies are not always an effective way to study associations, particularly when an outcome such as disease is rare or takes a long time to develop.

Case-control studies were born out of the best of intentions. However, they prove once again the maxim that the road to Hell is paved with good intentions. In a case-control study, the exposure status of a set of cases (those with some condition, such as a disease) is compared to the exposure status of a set of controls (those without the condition). For example, we might look at the smoking habits of those with and without lung cancer. Since we start out with a predetermined number of cases, the rarity of the disease is no longer an issue.

Case-control studies are fine in theory. However, they present a nearly insurmountable practical problem--the choice of controls. For example, suppose a study will involve cases of stomach cancer drawn from a hospital's gastrointestinal service. Should the controls be healthy individuals from the community served by the hospital, or should they be hospital patients without stomach cancer? What about using only patients of hospital's GI service with complaints other than stomach cancer? There is no satisfactory answer because no matter what group is used, the cases and controls do not represent random samples from any identifiable population. While it might be tempting to "assume a [simple random] situation and hope that it is not very unreasonable" there are too many instances where series of case-control studies have failed to provide similar results. Because of this inability to identify a population from which the subjects were drawn, many epidemiologists and statisticians have declared the case-control study to be inherently flawed.

There is one type of case-control study that everyone finds acceptable--the nested case-control study. A nested case-control study is a case-control study that is nested (or embedded) within a cohort study. The cases are usually all of the cases in the cohort while the controls are selected at random from the non-cases. Since the cohort is well-defined, it is appropriate to compare the rates of exposure among the cases and controls.

It is natural to ask why all non-cases are not examined, which would allow the data to be analyzed as coming from a cohort study. The answer is "resources". Consider a large cohort of 10,000 people that contains 500 cases. If the data are already collected, a computer can just as easily analyze 10,000 cases as 1,000, so the data should be analyzed as coming from a cohort study. However, the nested case-control study was developed for those situations where new data would have to be generated. Perhaps blood samples would have to be taken from storage and analyzed. If 500 controls would provide almost as much information as 9,500, it would be wasteful to analyze the additional 9,000. Not only would time and money be lost, but blood samples that could be used for other nested case-control studies would have been destroyed needlessly.

Intervention trials/Controlled clinical trials

Generally speaking, there are two types of intervention trials--superiority trials and equivalence trials. The purposes of the two types of trials are what their names suggest.

The vast majority of trials are superiority trials. They are the only ones that will be discussed here. While there will be no formal discussion of equivalence trials, you can get a feel for some of the ways equialence is established from the section on confidence intervals.

Randomized! Double-blind! Controlled!

When the results of an important intervention trial are reported in a highly-regarded, peer-reviewed journal, you will invariably see the trial described as randomized, double-blind, and (possibly placebo) controlled.

Suppose you have two treatments to compare. They might be two diets, two forms of exercise, two methods of pest control, or two ways to deliver prenatal care. How should you design your study to obtain a valid comparison of the two treatments? Common sense probably tells you, correctly, to

How do we make our groups of subjects comparable? Who should get what treatment, or, as a trained researcher would put it, how should subjects be assigned to treatment? It would be dangerous to allow treatments to be assigned in a deliberate fashion, that is, by letting an investigator choose a subject's treatment. If the investigator were free to choose, any observed differences in outcomes might be due to the conscious or unconscious way treatments were assigned. Unscrupulous individuals might deliberately assign healthier subjects to a treatment in which they had a financial interest while giving the other treatment to subjects whom nothing could help. Scrupulous investigators, eager to see their theories proven, might make similar decisions unconsciously.

Randomized

Randomized means that subjects should be assigned to treatment at random, so that each subject's treatment is a matter of chance, like flipping a coin. If nothing else, this provides insurance against both conscious and unconscious bias. Not only does it insure that the two groups will be similar with respect to factors that are known to effect the outcome, but also it balances the groups with respect to unanticipated or even unknown factors that might influence the outcome had purposeful assignments been used.

Sometimes randomization is unethical. For example, subjects cannot be randomized to a group that would undergo a potentially harmful experience. In such cases, the best we can often do is to conduct an observational study comparing groups that choose the behavior (such as smoking) with those who choose not to adopt the behavior, but these groups will often differ in other ways that may be related to health outcomes. When subjects cannot be randomized, studies are viewed with the same skepticism accorded to surveys based on nonrandom samples.

Resolution of the relation between lung cancer and cigarette smoking was achieved after a host of studies stretching over many years. For each study suggesting an adverse effect of smoking, it was possible to suggest some possible biases that were not controlled, thus casting doubt on the indicated effect. By 1964, so many different kinds of studies had been performed--some free of one type of bias--some of another--that a consensus was reached that heavy cigarette smoking elevated the risk of lung cancer. (Ultimately, the effect was recognized to be about a 10-fold increase in risk.) The story shows that induction in the absence of an applicable probability model is possible, but that induction in those circumstances can be difficult and slow.
According to The Tobacco Institute, the question of second hand smoke is still far from settled. Compare the long struggle over lung cancer and smoking to the one summer it took to establish the efficacy of the Salk polio vaccine.

On the other hand, it may not be unreasonable to randomize subjects away from potentially unhealthful behavior. If coffee drinking were thought to have a negative impact on some measure of health status, it would be unethical for a study to have coffee consumed by those who did not normally drink it. However, it might be ethical to ask coffee drinkers to give it up for the duration of the study. (The "might" here refers not so much to this hypothetical study but to others that might seem similar on the surface. For example, to study the cholesterol lowering property of some substance, it seems reasonable to work with subjects who already have elevated cholesterol levels. However, ethical behavior dictates that once subjects are identified as having elevated levels they should be treated according to standard medical practice and not studied!)

Randomization sounds more mysterious than it really is in practice, It can be as simple as assigning treatment by flipping a coin. You can generate a randomization plan automatically at randomization.com. Specify the names of the treatments and the number of subjects and the script will produce a randomized list of treatments. As subjects are enrolled into the study, they are given the next treatment on the list.

Double-blind

Blinded means blind with respect to treatment.

It's easy to come up with reasonable-sounding arguments for not enforcing blinding ("I won't be influenced by the knowledge. Not me!" "My contact is so minimal it can't matter."), but EVERY ONE IS SPECIOUS. The following example illustrates how fragile things are:

Patients (unilateral extraction of an upper and lower third molar) were told that they might receive a placebo (saline), a narcotic analgesic (fentanyl), or a narcotic antagonist (naloxone) and that these medications might increase the pain, decrease it, or have no effect. The clinicians knew that one group (PN) of patients would receive only placebo or naloxone and not fentanyl and that the second group (PNF) would receive fentanyl, placebo, or naloxone. All drugs were administered double blind. Pain after placebo administration in group PNF was significantly less than after placebo in group PN! The two placebo groups differed only in the clinicians' knowledge of the range of possible double blind treatments. (Gracely et al., Lancet, 1/5/85, p 43)
When a new drug/technique is introduced, it is almost always the case that treatment effects diminish as the studies go from unblinded to blind to double-blind.

That is,

Sometimes blinding is impossible. Women treated for breast cancer knew whether they had a lumpectomy, simple mastectomy, or radical mastectomy; subjects know whether they are performing stretching exercises or strength training exercises; there is no placebo control that is indistinguishable from cranberry juice. (Recently, we were looking for a control for black tea. It had to contain everything in black tea except for a particular class of chemicals. Our dieticians came up with something that tastes like tea, but it is water soluble and doesn't leave any leaves behind.) In each case, we must do the best we can to make treatments as close as possible remembering that the differences we observe reflect any and all differences in the two treatments, not just the ones we think are important. As already noted, even if subjects can't be blinded, it may still be possible to maintain the blind for those evaluating outcomes.

No matter how hard it might seem to achieve blinding in practice, barriers usually turn out to be nothing more than matters of inconvenience. There are invariably ways to work around them. Often, it is as simple as having a colleague or research assistant randomize treatments, prepare and analyze samples, and make measurements.

Evaluating A Single Treatment

Often, intervention trials are used to evaluate a single treatment. One might think that the way to conduct such studies is to apply the treatment to a group of subjects and see whether there is any change. Unfortunately, many investigators do just that. (Whenever you see the phrase, "subjects as their own controls", be afraid, be very afraid!) However, that approach makes it difficult, if not impossible, to draw conclusions about a treatment's effectiveness.

Things change even when no specific intervention occurs. For example, cholesterol levels probably peak in the winter around Thanksgiving, Christmas, and New Year's when people are eating heavier meals and are lower in the summer when fresh fruits and vegetables are in plentiful supply. In the Northeast US, women see a decline in bone density during the winter and increase during the summer because of swings in vitamin D production from sunlight exposure. (Imagine an effective treatment studied over the winter months and called ineffective because there was no change in bone density! Or, an ineffective treatment studied over the summer and called effective because there was a change!)

When a treatment is described as effective, the question to keep in mind is, "Compared to what?" In order to convince a skeptical world that a certain treatment has produced a particular effect, it must be compared to a regimen that differs only in the facet of the treatment suspected of producing the effect.

Placebo controlled means that the study involves two treatments--the treatment under investigation and an ineffective control (placebo) to which the new treatment can be compared. At first, such a control group seems like a waste of resources--as one investigator describes them, a group of subjects that is "doing nothing". However, a treatment is not just the taking some substance or following a particular exercise program, but all of the ways in which it differs from "doing nothing". That includes contact with health care professionals, heightened awareness of the problem, and any changes they might produce. In order to be sure that measured effects are the result of a particular facet of a regimen, there must be a control group whose experience differs from the treatment group's by that facet only. If the two groups have different outcomes, then there is strong evidence that it is due to the single facet by which the two groups differ.

One study that was saved by having a reliable group of controls is the Multiple Risk Factors Intervention Trial (MRFIT), in which participants were assigned at random to be counseled about minimizing coronary risk factors, or not. Those who were counseled had their risk of heart attack drop. However, the same thing happened in the control group! The the trial took place at a time when the entire country was becoming sensitive to and educated about the benefits of exercise, a low fat diet, and not smoking. The intervention added nothing to what was already going on in the population. Had there be no control group--that is, had historical controls been used--it is likely that the intervention would have been declared to be of great benefit. Millions of dollars and other resources would have been diverted to an ineffective program and wasted.

What about comparing a treatment group to a convenience sample--a sample chosen not through a formal sampling procedure, but because it is convenient? Perhaps it would allow us to avoid having to enroll, randomize, and monitor subjects who are "doing nothing". The Salk vaccine trials have something to say about that.

Groupcases per 100,000
placebo
71
innoculated
28
refused
46

When the trial was proposed, it was suggested that everyone whose parents agreed to participate should be inoculated while all others should be used as the control group. Fortunately, cooler heads prevailed and placebo controls were included. It turned out that those parents who refused were more likely to have lower incomes and only one child. The choice to participate was clearly related to susceptibility to polio. (income=hygiene, contagious disease). Imagine what the results would have been if there were no control group and the previous year's rates were used for comparison...and this turned out to be a particularly virulent year.

Even "internal controls" can fool you.

Studies with clofibrate showed that subjects who took 80% or more of their drug had substantially lower mortality than subjects who took less; this would seem to indicate that the drug was beneficial. But the same difference in mortality was observed between subjects with high and subjects with low compliance whose medication was the placebo. Drug compliance, a matter depending on personal choice, was for some reason related to mortality in the patients in this study. Were it not for the control group, the confounding between the quantity of drug actually taken (a personal choice) and other factors related to survival might have gone unnoticed, and the theory "more drug, lower mortality: therefore, the drug is beneficial" might have stood--falsely. (Moses, 1985, p.893)
It is important to check the treatment and placebo to make certain that they are what they claim to be. Errors in manufacturing are not unheard of. I've been involved in studies where the placebo contained a small amount of the active treatment. I've also been involved in a study where the packaging facility reversed the treatment labels so that, prior to having their labels removed before they were given to subject, the active treatment was identified as placebo and placebo as active treatment! Be knowledgeable about the consequences of departures from a study protocol. An unreported communion wafer might not seem like a problem in a diet study--unless it's a study about gluten-free diets.

Placebo controls are unnecessary when comparing two active treatments. I have seen investigators include placebo controls in such studies. When pressed, they sometimes claim that they do so in order to monitor temporal changes, but I try to dissuade them if that is the only purpose.

Subjects As Their Own Controls

Despite the almost universally recognized importance of a control group, it is not uncommon to see attempts to drop it from study in the name of cost or convenience. A telltale phrase that should put you on alert is that "Subjects were used as their own controls."

Subjects can and should be used as their own controls if all treatments can be administered simultaneously (e.g., creams A and B randomly assigned to the right and left arms). But in common parlance, "using subjects as their own controls" refers to the practice of measuring a subject, administering a treatment, measuring the subject again, and calling the difference in measurements the treatment effect. This can have disastrous results. Any changes due to the treatment are confounded with changes that would have occurred over time had no intervention taken place. We may observe what looks like a striking treatment effect, but how do we know that a control group would not have responded the same way?

MRFIT is a classic example where the controls showed the same "effect" as the treatment group. Individuals who were counseled in ways to minimize the risk of heart attack did no better than a control group who received no such counseling simply because the population as a whole was becoming more aware of the benefits of exercise. When subjects are enrolled in a study, they often increase their awareness of the topic being studied and choose behaviors that they might not have considered otherwise. Dietary intake and biochemical measures that are affected by diet often change with season. It would not be surprising to see a group of subjects have their cholesterol levels decrease after listening to classical music for six months--if the study started in January after a holiday season of relatively fatty meals and ended in July after the greater availability of fresh produce added more fruits and salads to the diet.

The best one can do with data from such studies is argue the observed change was too great for coincidence. ("While there was no formal control group, it is biologically implausible that a group of subject such as these would see their total cholesterol levels drop an average of 50 mg/dl in only 4 weeks. It would be too unlikely a coincidence for a drop this large to be the result of some uncontrolled factor.") The judgment that other potential explanations are inconsequential or insubstantial is one that must be made by researchers and their audience. It can't be made by statistical theory. The justification must come from outside the data. It could prove embarrassing, for example, if it turned out that something happened to the measuring instrument to cause the drop. (A control group would have revealed this immediately.)

In the search of ways to minimize the cost of research and to mitigate the effects of temporal effects, some studies adopt a three-phase protocol: measurement before treatment, measurement after treatment, measurement after treatment ceases and a suitable washout period has expired, after which time subjects should have returned to baseline. In theory, if a jump and a return to baseline were observed, it would require the most remarkable of coincidences for the jump to be due to some outside factor. There are many reasons to question the validity of this approach.

Despite this indictment against this use of subjects as their own controls, cost and convenience continue to make it tempting. In order to begin appreciating the danger of this practice, you should look at the behavior of the control group whenever you read about placebo-controlled trials. You will be amazed by the kinds of effects they exhibit.

The Ethics of Randomized Trials

When a trial involves a health outcome, investigators should be truly indifferent to the treatments under investigation (equipoise). That is, if investigators were free to prescribe treatments to subjects, they would be willing to choose by flipping a coin. In the case of a placebo controlled trial, investigators must be sufficiently unsure of whether the "active" treatment is truly effective.

This may seem like a strange requirement for superiority trials, where the goal is to establish that a new treatment is better than standard care. However, if an investigator were so convinced of the superiority of the new treatment, how could s/he ethically allow a patient to be randomized to something else? Hope that a new treatment is superior is different from the knowledge that a new treatment is superior. The path of research is littered with treatments that were supposed to be effective but proved otherwise, even harmful in some cases.

Ethical considerations forbid the use of a placebo control if it would withhold standard treatment known to be effective. In such cases, the control must correspond to standard medical practice and the research question should be rephrased to ask how the new treatment compares to standard practice. For example, in evaluating treatments for high cholesterol levels, it would be unethical to do nothing for subjects known to have high levels. Instead, they would receive the treatment dictated by standard medical practice, such as dietary consultation and a recommendation to follow the AHA Step 1 diet or even treatment with cholesterol lowering drugs.

It is impossible, in a few short paragraphs, to summarize or even list the ethical issues surrounding controlled clinical trials. Two excellent reference is Levin RJ (1986), Ethics and Regulation of Clinical Research, 2nd ed. New Haven: Yale University Press and Dunn CM & Chadwick G (1999), Protecting Study Volunteers in Research. Boston: CenterWatch, Inc.

It is difficult to pass up the opportunity to comment on the dental study cited earlier from the Lancet. It would have been fascinating to have listened to the Institutional Review Board's discussion about whether to give its approval. There are dental patients who are being deceived. The true subjects of the study, the investigators who knew the potential range of each subject's treatments, were never given the chance to consent to be part of the study. Yet, it would be impossible to do the research without the deceptions. Since it speaks to the validity of the randomized trial, there is important knowledge to be gained.

I would be inclined to approve the study, subject to the usual requirements of debriefing subjects fully at the end. However, since the goal is to compare the placebo groups, I would require that the investigator explain why it is necessary to actually give some subjects a narcotic antagonist. Since subjects and investigators are already being deceived, would it invalidate the research to tell some subjects that that they might receive an narcotic antagonist, but not actually give it to anyone?

An online short course on the protection of human subjects can be found at http://cme.cancer.gov/clinicaltrials/learning/humanparticipant-protections.asp. It's interesting, well-designed, and includes many informative links worth bookmarking. It even offers a certificate for completion.

Sample Size Calculations

A properly designed study will include a justification for the number of experimental units (subjects/animals) being examined. No one would propose using only one or two subjects per drug to compare two drugs, because it's unlikely that enough information could be obtained from such a small sample. On the other hand, applying each treatment to millions of subjects is impractical, unnecessary, and unethical. Sample size calculations are necessary to design experiments that are large enough to produce useful information and small enough to be practical. When health outcomes are being studied, experiments larger than necessary are unethical because some subjects will be given an inferior treatment unnecessarily.

Some Miscellaneous Topics

One vs Many

Many measurements on one subject are not the same thing as one measurement on many subjects. With many measurements on one subject, you get to know the one subject quite well but you learn nothing about how the response varies across subjects. With one measurement on many subjects, you learn less about each individual, but you get a good sense of how the response varies across subjects. A common mistake is to treat many measurements on one subject as though they were single measurements from different subjects. Valid estimates of treatment effects can sometimes be obtained this way, but the uncertainty in these estimates is greatly underestimated. This could lead investigators to think they have found an effect when the evidence is, in fact, insufficient.

The same ideas apply to community intervention studies, also called group-randomized trials. Here, entire villages, for example, are assigned to the same treatment. When the data are analyzed rigorously, the sample size is the number of villages, not the number of individuals. This is discussed further under units of analysis.

Paired vs Unpaired Data

Data are paired when two or more measurements are made on the same observational unit. The observational unit is usually a single subject who is measured under two treatment conditions. However, data from units such as couples (husband and wife), twins, and mother-daughter pairs are considered to be paired, too. They differ from unpaired (or, more properly, independent) samples, where only one type of measurement is made on each unit. They require special handling because the accuracy of estimates based on paired data generally differs from the accuracy of estimates based on the same number of unpaired measurements.

Parallel Groups vs Cross-Over Studies

In a parallel groups study, subjects are divided into as many groups as there are treatments. Each subject receives one treatment. The treatments are studied in parallel.

In a cross-over study, all subjects end up receiving all treatments. When there are two treatments, half of the subjects are given A followed by B; the other half are given B followed by A. That is, midway throught the study, subjects cross over to the other treatment.

Cross-over studies are about as close you can come to the savings investigators would like to realize by using subjects as their own controls, but they contain two major drawbacks. The first problem is the possibility of a carryover effect: B after A may behave differently from B alone. The second is the problem of missing data; subjects who complete only one of the two treatments complicate the analysis. It's for good reason, then, that the US Food & Drug Administration looks askance at almost anything other than a parallel groups analysis.

Repeated Measures Designs

In repeated measures designs, many measurements are made on the same individual. Repeated measures can be thought of as a generalization of paired data to allow for more than two measurements. The analysis of paired data will be identical to an analysis of repeated measures with two measurements. Some statisticians maintain a distinction between serial measurements and repeated measures. According to the strict definitions

For example, the term serial measurements would be used when a subject's blood pressures is measured in the same way over time. Repeated measures would be used to describe a study in which subjects' blood pressure was measured many different ways (sitting, lying, manually, automated cuff) at once. I, myself, do not object to the use of the term repeated measures in conjunction with serial measurements.

Often the analysis of serial measurements can be greatly simplified by reducing each set of measurements to a single number (such as a regression coefficient, peak value, time to peak, or area under the curve) and then using standard techniques for single measurements.

Intention-To-Treat & Meta Analysis

Among the topics that should be included in these notes are the highly controversial Intention-To-Treat Analysis (ITT) and Meta Analysis. Like many statisticians, I have strong feelings about them. Because these are highly charged issues, I have placed them in their own Web pages to give them some distance from the more generally accepted principles presented in this note.

Intention-To-Treat Analysis
Meta Analysis

The Bottom Line

We all want to do research that produces valid results, is worthy of publication, and meets with the approval of our peers. This begins with a carefully crafted research question and an appropriate study design. Sometimes all of the criteria for a perfect study are not met, but this does not necessarily mean that the work is without merit. What is critical is that the design be described in sufficient detail that it can be properly evaluated. (The connection between Reye's syndrome and aspirin was established in a case-control pilot study that was meant to try out the machinery before embarking on the real study. An observed odds ratio of 25 led the researchers to publish the results of the pilot.) Any study that is deficient in its design will rarely be able to settle the question that prompted the research, but it may be able to provide valuable information nonetheless.

[back to The Little Handbook of Statistical Practice]



Copyright © 1998 Gerard E. Dallal