OCR Text |
Show On occasion, it is important to step out of your comfort zone and look critically at accepted dogma. That is the purpose of this editorial. Although not dealing directly with our subspecialty, statistics play an important role in how we interpret the literature. There is currently an ongoing debate by statisticians as to what are the best metrics and methods to assess clinical studies. As evidenced here, P values are not the ultimate test of scientific legitimacy, but rather only one (controversial, imperfect) recognized tool for evaluating experimental results. Value of the P Value Jack Parker, MD, Carrie Huisingh, MPH, Gerald McGwin Jr, MS, PhD In modern medical research, P values have become, in the minds of many, the most important indicator of the truth of a scientific proposition and the key instrument differentiating "real" effects from those due to "random chance." Consequently, it is no exaggeration to say that the landscape of medical research has been decisively shaped by a pervasive belief in the power of P values. They have come to determine not only which studies are published but also which projects are funded and which ideas pursued. It is, therefore, shocking to many physicians to learn that a legitimate suspicion exists among statisticians regarding the real usefulness of P values, particularly as a stand-alone measure of validity. This skepticism has its roots in, of all places, the Guinness brewery, the birth place of P values and the concept of statistical significance. In 1908, William Sealy Gosset, head experimental brewer for Guinness Beer in London and mathematician, published an article on "statistical significance," in which he endeavored to explain how to determine which inputs in the brewing process made the greatest difference in the quality of the drink. Nevertheless, in this article, Gosset wrote, "The important thing is to have a low real error, not to have a ‘significant' result at a particular station. The latter seems to me to be nearly valueless in itself" (1). Another statistician of the time, Ronald Fisher, is often credited with the ubiquitous use of 5% as the benchmark for scientific legitimacy although this is often based on a mis-understanding of Gosset work. Debates between Fisher and 2 other statisticians of the time, Jerzy Neyman and Egon Pearson, perhaps best illustrate the controversy regarding the P value; is it an absolute measure to be interpreted in context or one either above or below a prespecified benchmark, typically 5%? These debates regarding the use of P values have been inherited by subsequent generations of statisticians; yet, as with many traits that are passed from one generation to the next and become muted with time, P values are frequently misinterpreted and vastly misunderstood (2-7). When asked to explain a P value a not uncommon response is "It is the probability that the null hypothesis is true." This is incorrect. In fact, it is the exact opposite of the truth. This is because the operating principle of the P value is the assumption that the null hypothesis is correct, that is, there is no real effect of a given intervention. Therefore, the information provided by P values is the probability of the data, given the assumption that the intervention is ineffectual. As a result, the P value provides no information (and makes no attempt) to inform about the probability of null hypothesis. Another misconception is that the P value ,0.05 is "statistically significant" and therefore must be clinically important. This is not correct for several reasons. First, the difference may be too small to be clinically meaningful. The P value carries no information about the magnitude of the effect and Department of Ophthalmology (JP, CH, GM), School of Medicine, University of Alabama at Birmingham, Birmingham, Alabama; and Department of Epidemiology (GM), School of Public Health, University of Alabama at Birmingham, Birmingham, Alabama. Supported by the National Eye Institute (R01-EY18966) and the National Institute on Aging (P30-AG22838) of the National Institutes of Health, The American Recovery and Reinvestment Act of 2009, the EyeSight Foundation of Alabama; the Able Trust, Research to Prevent Blindness, Inc., and the University of Alabama at Birmingham Center for Aging Research Scholarship in Aging. The authors report no conflicts of interest. Address correspondence to Gerald McGwin, Jr, MS, PhD, Department of Ophthalmology, University of Alabama at Birmingham, 700 S. 18th Street, Suite 609, Birmingham, AL 35294-0009; E-mail: mcgwin@uabmc.edu Parker et al: J Neuro-Ophthalmol 2015; 35: 233-234 233 Editorial Copyright © North American Neuro-Ophthalmology Society. Unauthorized reproduction of this article is prohibited. precision of the estimate, which are captured by the point estimate and the confidence interval. A very small P value such as ,0.01 does not necessarily mean a strong associa-tion (8,9). The strength of the association comes from the effect size, which is a measure of the strength of the relation-ship between 2 variables (e.g., odds ratio, relative risk, and correlation coefficient) (10). Second, the end point itself may not be clinically important (e.g., use of surrogate out-comes). Third, it is possible to achieve a P value ,0.05 simply by changing the sample size of the sample. A far greater problem arises from misinterpretation of nonsignificant findings. A P value .0.05 is often called "nonsignificant." The term "nonsignificant" wrongly im-plies that the study has shown that there is no difference between groups and that a nonsignificant P value is a good evidence of a true null hypothesis (4). This does not mean that the treatment is not beneficial; only that the possi-bility of chance producing a difference of this size is so large that it is impossible to demonstrate the "signifi-cance" of the treatment effect. Although it is usually rea-sonable not to accept a new treatment unless there is positive evidence in its favor, when issues of public health are concerned, the absence of evidence is not always valid justification for inaction (4). Rather, other evidence is needed to appropriately accept the null hypothesis as true. The magnitude of the association and confidence intervals that can help researchers make an informed interpretation (11). The other problem is that this terminology perpet-uates the idea that the results must fall on one side or another of a demarcation as if the study conclusively proved whether a certain phenomenon existed when, in reality, one of the established beliefs of modern biomed-ical statistics is that results are simply statistically signifi-cant or not (10,12,13). Having described what a P value is not, the question remains what is a P value? A P value is a probability of obtaining a result at least as extreme as the observed results in a study, when the null hypothesis is really true (14). A P value is a continuous measure with a uniform distribution ranging from zero to one; however, the P value convention-ally is dichotomized at 0.05. If the P value is below 0.05, the null hypothesis is rejected and the observed results are called "significant." A P value ,0.05 is an arbitrary cut-point for a statistic that captures one of many possible sources of error, so the correct evaluation of a P value is fundamentally a qualitative process. Moreover, it is but one of several pieces of information that should be used to interpret the results of scientific research, the magnitude of the effect and the associated, typically 95% confidence interval. Taken together, this information provides the researcher, and perhaps more importantly, the scientific community and the public, a more robust perspective on a study's results. They allow us to appropriately dismiss statistically significant results associated with clinically meaningless effect sizes while advancing nonsignificant re-sults wherein the effect size was large. The value of P values is not their ability to serve as an isolated, easily understood, scientific seal of approval; rather, they are but one piece of scientific evidence that, when properly applied and com-bined with all other available evidence, can be used to begin a conversation regarding the proper interpretation of a study's results. REFERENCES 1. Ziliak S, McClosky D. The Cult of Statistical Significance. Ann Arbor, MI: The University of Michigan Press, 2008. 2. Berkson J. Tests of significance considered as evidence. J Am Stat Assoc. 1942;37:325-335. 3. Rothman KJ. Significance questing. Ann Int Med. 1986;105:445-447. 4. Altman DG, Bland JM. Absence of evidence is not evidence of absence. Br Med J. 1995;311:485. 5. Feinstein AR. P-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol. 1998;51:355- 360. 6. Goodman SN. Toward evidence-based medical statistics. 1: the P value fallacy. Ann Int Med. 1999;130:995-1004. 7. Pharoah P. How not to interpret a P value? J Natl Cancer Inst. 2007;99:332-333. 8. Fisher R. Statistical Methods for Research Workers. Edinburgh, Scotland: Oliver and Boyd, 1950. 9. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Rev Camb Philos Soc. 2007;82:591-605. 10. Rothman KJ, Greeland S. Modern Epidemiology. 2nd edition. Philadelphia, PA: Lippincott Williams & Wilkins, 1998. 11. Poole C. Low P-values or narrow confidence intervals: which are more durable? Epidemiology. 2001;12:291-294. 12. Poole C. Beyond the confidence interval. Am J Pub Health. 1987;77:195-199. 13. Rothman KJ, Lanes S, Robins J. Causal inference. Epidemiology. 1993;4:555-556. 14. Schervish MJ. P values: what they are and what they are not. Am Stat. 1996;50:203-206. 234 Parker et al: J Neuro-Ophthalmol 2015; 35: 233-234 Editorial Copyright © North American Neuro-Ophthalmology Society. Unauthorized reproduction of this article is prohibited. |