By Robert P. Murphy
In recent years,
researchers in the social sciences have realized that they face a
"replication crisis." As major results in experimental
psychology fall apart under further scrutiny, economists might have taken
comfort in the relative rigor of their own field. However, economists, too,
have often been careless in their research design and have been overconfident
in the strength of their published results.
In a recent
episode of EconTalk, Russ Roberts interviewed Stanford University's John Ioannidis
to discuss his 2017 publication "The Power of Bias in Economics
Research." Ioannidis and his co-authors found that the vast majority
of estimates of economic parameters were from studies that were
"underpowered," and this, in turn, meant that the published estimates
of the magnitude of the effects were often biased upward.
Unfortunately,
many economists (including me) have little training in the concept of
"statistical power" and might be unable to grasp the significance of
Ioannidis' discussion. In this article, I give a primer on statistical power
and bias that will help the reader appreciate Ioannidis et al.'s shocking
results: After reviewing meta-analyses of more than 6,700 empirical studies,
they concluded that most studies, by their very design, would often fail to
detect the economic relationship under study. Perhaps worse, these
"underpowered" studies also provided estimates of the economic
parameters that were highly inflated, typically by 100% but in one third of the
cases, by 300% or more.
Economists should
familiarize themselves with the concept of statistical power to better
appreciate the possible pitfalls of existing empirical work and to produce
more-accurate research in the future.
A Primer on Power and Bias: Researchers Flipping Coins
Suppose
that researchers are trying to determine whether the coins produced by a particular
factory are "fair," in the sense that they turn up Heads or Tails
with a 50/50 probability. To that end, a researcher performs an experiment by
flipping a coin a certain number of times and recording the sequence of Heads
and Tails.
In this
experimental setup, the "null hypothesis" is that the coin is fair.
Thus, in order to reject the null and conclude that the coin is not fair,
our researcher will need to see an outcome that is weighted towards either
Heads or Tails.
Our
researcher wants to protect himself from committing a "Type I error,"
in which he would erroneously reject the null hypothesis. If
the researcher committed a Type I error, his experiment would be giving him a
"false positive." That is, the researcher would announce to the world
that the coin is not fair, even though it actually is fair.
For
example, suppose that the researcher flips the coin only twice, and it comes up
"Tails, Tails." Prima facie, this sequence suggests that the coin is
unfair—that it is biased towards Tails. However, even if the coin were
perfectly fair, there is a 25% chance that it would generate 2 Tails (or 2
Heads) in a row. Since, after 2 flips, there is a combined 50% chance of seeing
2 of the same outcome (either "Heads, Heads" or "Tails, Tails"),
it would be reckless for the researcher to announce, "The coin is
unfair!" after only 2 observations.
To
safeguard against a Type I error, the researcher adopts the standard convention
of insisting on a "5% significance level." This means that the
researcher wants to announce that the coin is unfair only if there is a 5% or
smaller probability that in so doing, he has been fooled by a false positive.
How many flips will he
need in order to have any chance of rejecting the null at the 5% level of
significance? At least 6. Consider: A fair coin will come up Heads 6 times in a
row with a probability of (1/2)^6 = (1/64), which is approximately 1.6% of the
time. Likewise, a fair coin will generate 6 consecutive Tails about 1.6% of the
time. Consequently, the chance of seeing either 6 Heads or 6
Tails—assuming the coin, in reality, is fair—is only (2/64) or 3.125%. That is
lower than our 5% significance threshold. Therefore, if our researcher conducts
an experiment involving 6 flips and observes either 6 Heads or 6 Tails, he can
confidently announce to the world that the coin is not fair.
(Note that if the researcher flips the coin 6 times and sees only 5 Heads and 1
Tail or 5 Tails and 1 Head, in any order, then that wouldn't be
sufficient evidence of an unfair coin.)
The Tradeoff Between Type I and Type II Errors
However,
even though his adoption of the 5% significance threshold protects our
researcher from Type I errors, the small sample size (of 6 flips) makes him
vulnerable to a Type II error. A Type II error occurs when a researcher fails
to reject the null hypothesis when it really isfalse; in this case,
the researcher falls prey to a "false negative."
To quantify
the probability of a Type II error, we need to specify exactly how the
null hypothesis is false. In our example, suppose that the coin really is unfair
and that it comes up Heads 3/4 of the time and Tails only 1/4 of the time. If
our researcher flips the coin only 6 times, what is the likelihood that he will
correctly announce to the world, "My research shows that this coin is
unfair"?
To review
our earlier computations, the convention of adopting a 5% threshold for
"significance" means that in this small sample, our researcher must
observe either 6 Heads or 6 Tails in order to rule out the null hypothesis of a
fair coin. So what is the probability of observing such sequences, if in
fact—by stipulation in this example—the coin really is unfair—that
is, it comes up Heads 3/4 of the time?
With this
particular unfair coin, the probability of its coming up Heads 6 times out of 6
flips is (3/4)^6, or a little less than 18%. (There is also a very small
probability of its coming up Tails 6 times in a row.) In other words, even
though we assumed that this coin is unfair, our researcher has only
an 18% chance of concluding this; there is a corresponding 82%
probability of a Type II error. Thus, we say that the power of
this experiment is only 18%—i.e., it is an underpowered study.
For a given sample size,
there is a tradeoff between Type I and Type II errors. Just as it is routine to
insist on a maximum 5% for the probability of a Type I error, there is a
convention that a study have a power of at least 80%, so that the probability
of a Type II error is held to 20% or lower. The only way for our researcher to
increase the power of his study (i.e., reduce the chance of a false
negative) without leaving himself more vulnerable to a false
positive is to increase the sample size—to flip the coin more times.
Underpowered Studies and the Problem of Bias
We can use
our contrived but intuitive example to illustrate one more feature: A sample
size of 6 flips means that there is a high probability that our researcher will
fail to detect an unfair coin. It also means that on the off chance that
he does detect it, he might inflate the actual magnitude of
the coin's unfairness.
To see
this, suppose that many dozens of researchers are all vying for grants from the
Anti-Coin League, an organization that funds efforts to discredit the public's
faith in these coins. Further suppose that we are still dealing with the case
in which the coin really is unfair and comes up Heads 3/4 of
the time. Now, if all of the researchers in this community conduct experiments
involving only 6 coin flips, some of them will eventually
observe 6 Heads in a row.
Note that
we are not questioning the integrity of the scientists
involved—they aren't cheating in any way. Each flips the coins 6 times and
accurately reports the outcome. Furthermore, the scientific journals likewise
have their standards and will publish only results that are "statistically
significant." Even though the Anti-Coin League is doling out the dough,
the journals will publish only the results of a researcher who observed 6 Heads
because even a result of 5 Heads and 1 Tail could be due to chance.
Yet what
happens when the researcher makes an estimate of just howunfair the
coin is? Since he observes 6 Heads and 0 Tails—the necessary outcome to be
"significant" and worthy of publication—the "sample mean"
of the probability of Heads is 100%, while the "sample mean"
probability of Tails is 0%. In other words, there will be an entire literature
consisting of papers finding statistically significant evidence that the coins
are unfair, in which the "best guess" is that the coins come up Heads
all the time and never come up Tails.
So we see—in this
particular example with a coin coming up Heads 3/4 of the time—that a sample
size of 6 flips would mean that a given test had a power of only 18% and that
the typical reported magnitude of the effect would be severely
inflated.
Larger Sample Size Helps on Both Dimensions
If we want
to retain the "5% significance" threshold, then, as already noted,
the only way to increase the power of our study is to increase the sample size.
For example, suppose that instead of flipping the coin only 6 times, our
researcher flips it 10 times. What effect does this have?
First, we
need to recalculate how lopsided an outcome we would need to observe in order
to reject the null hypothesis ("this is a fair coin") with at least
95% confidence. It turns out we would still need to see either 9 or 10 of the
same side (either Heads or Tails) in order to confidently reject the
possibility that a fair coin generated such a sequence.
Now that we
know the observational threshold to reject the null, we can compute the power
of our researcher's new study, which relies on a sample size of 10. As before,
suppose that, in reality, the coin is unfair and comes up
Heads 3/4 of the time. With our larger sample size, the probability that such a
coin will generate at least 9 Heads is a bit more than 24%. That is, by
increasing the sample size from 6 to 10, we have boosted the power of our
study—if we maintain our assumption that a coin comes up Heads 3/4 of the time—from
18% to 24%.
Furthermore, if we
imagine a community of researchers running experiments using 10 flips in each
study, then, among the approximately 24% of them who find "statistically
significant" evidence that the coins are unfair, about 77% of this batch
of studies will report that the coins come up Heads 9 out of 10 flips, while
the remaining 23% will estimate that the coins always come up
Heads. Thus, these higher-powered studies are still very biased—they
overestimate how unfair the coins are—but they are closer to
the true value than the previous case, when 100% of the studies finding
statistical significance concluded that the coins always came up Heads.
It's Harder to Detect a Weaker Signal
The final
observation to make regarding our hypothetical coin research is that our
measures of statistical power and bias depended on our assumption that the
coin, in fact, came up Heads 3/4 of the time. Suppose, instead, that the coins
were only a little unfair, coming up Heads only 51% of the
time.
With our
larger sample size of 10 flips, the threshold for ruling out the null
hypothesis is the same. However, now that the signal is weaker, it will be much
harder to detect the presence of an unfair coin. Specifically, if the coin, in
reality, comes up Heads 51% of the time, there is only a 2% chance that a
researcher in a given experiment would observe at least 9 Heads or Tails. Thus,
with such a weak signal, the power of our 10-flip experiment would drop to 2%.
And the problem of bias would be much larger because, among the rare
experiments that found statistical significance, most would vastly overstate
the coin's tendency to turn up Heads, but, perversely, 38% of this small batch
of "unfair coin" studies would estimate that the coin comes up Tails 9
out of 10 times—getting the magnitude and the sign wrong.
As our exaggerated
examples illustrate, we cannot measure the power of a study without prior
knowledge about the true value of the parameter being
estimated. (In practice, of course, we can't have this knowledge and so we must
always estimate the power of a study as well.) Other things
equal, the weaker the "true" effect—though different from zero—the
lower the power of a given experiment or study, and the more likely it is that
the researcher will end up exaggerating the actual magnitude of the effect
should he correctly reject the null.
Back to Ioannidis
The above
analysis sets the stage for comprehending the importance of Ioannidis et al.'s
recent study. They relied on 159 "meta-analyses" of more than 6,700
empirical studies, which collectively provided more than 64,000 individual
estimates of economic parameters. By using various techniques that give more
weight to the more-reliable estimates within a literature to estimate the
"true" value of a parameter, Ioannidis et al. could then
retroactively calculate the statistical power of the studies in each area of
inquiry, to see which were "adequately" powered (meaning that they
had a power of at least 80%).
Their
findings were sobering. When the authors ranked these 159 different
"research areas" according to the percentage of their studies that
had adequate statistical power, the median outcome was 10.5%, and that finding
relies on the most generous of the techniques to estimate the "true
value" of an economic parameter. The authors write: "That is, half of
the areas of economics have approximately 10% or fewer of their estimates with
adequate power."
Furthermore, the
underpowered studies also implied very large biases in estimates of the
magnitude of economic parameters. For example, of 39 separate estimates of the
monetary value of a "statistical life"—a concept used in cost/benefit
analyses of regulations—29 (74%) of the estimates were underpowered. For the 10
studies that had adequate power, the estimate of the value of a statistical
life was $1.47 million, but the 39 studies collectively gave a mean estimate of
$9.5 million. After our hypothetical examples of coin-flipping researchers,
this real-world example leads one to suspect that the figure of $9.5 million is
likely to be vastly exaggerated.
Conclusion
As a recent paper by
Ioannidis et al. illustrates, economists should be more careful with their
statistics. At the very least, empirical economists should broaden their
horizons to realize that "statistical significance" by itself is not
enough; one must also consider the power of a study.
Currently, an alarming proportion of empirical investigations of economic
parameters are underpowered, meaning that the "significant" results
are quite possibly very biased. Listening to Russ Roberts' recent interview
with Ioannidis would help economists avoid pitfalls in future research.
Footnotes
The best
timeline I have found of the brewing "replication crisis" in
experimental psychology and sociology is provided by Andrew Gelman, professor
of statistics and director of the Applied Statistics Center at Columbia
University. See Gelman's post, "What has happened down
here is the winds have changed,"AndrewGelman.com, September
21, 2016.
For a
description of a major analysis that couldn't replicate even half of 100
psychology findings published in leading journals, see Benedict Carey, "Many Psychology Findings
Not as Strong as Claimed, Study Says", New York
Times, August 27, 2015. As the article explains, "The new analysis,
called the Reproducibility Project, found no evidence of fraud or that any
original study was definitively false. Rather, it concluded that the evidence
for most published findings was not nearly as strong as originally
claimed."
Ioannidis'
full bio is available at https://med.stanford.edu/profiles/john-ioannidis. The January 20, 2018
episode of EconTalk is available at John Ioannidis on Statistical
Significance, Economics, and Replication.
John P. A.
Ioannidis, T. D. Stanley and Hristos Doucouliagos. (2017) "The Power of Bias In
Economics Research,"The Economic Journal, Vol. 127,
Issue 605, October 2017.
Specifically,
the probability of seeing exactly 5 Heads out of 6 total flips is about 0.0938,
and of seeing exactly 6 Heads is 0.0156. So the probability of seeing at least
5 Heads or 5 Tails is about 0.2188, well above the 5%
significance cutoff. In other words, our researcher couldn't confidently reject
the hypothesis that the coin is fair if he sees a 5-to-1 lopsided outcome with
6 total flips.
It's even
conceivable that a researcher observes 6 Tails in a row, but this outcome is
expected to happen less than once per every 4,000 trials.
Specifically, the
borderline sequence is when the coin comes up with one side (let's say Heads)
exactly 2 out of 10 times. The probability of this happening with a fair coin
is (1/2)^10 * (10*9) / 2 ≈ 0.0439. Therefore the probability of seeing eitherexactly
2 Heads or exactly 2 Tails is approximately 0.0879, which is
higher than the 5% threshold. In contrast, the probability of seeing either exactly
1 Head or exactly 1 Tail is only 0.0195, meaning that this
sequence would lead the researcher to confidently reject the null hypothesis
that this is a fair coin. (We also need to include the probability of seeing
all 10 Heads or all 10 Tails, because the rule is "reject the null if we
see at least 9 Heads or Tails," but these outcomes are
very rare and don't tip our rule above 5% total probability with a fair coin.)
*I thank Kevin Grier and Alan Murphy for helpful
comments on an initial draft.
Robert P. Murphy is Research Assistant Professor with the Free Market Institute at Texas Tech University. He is the author of Choice: Cooperation, Enterprise, and Human Action (Independent Institute, 2015).
Robert P. Murphy is Research Assistant Professor with the Free Market Institute at Texas Tech University. He is the author of Choice: Cooperation, Enterprise, and Human Action (Independent Institute, 2015).
No comments:
Post a Comment