Medical Uncertainties and P-values
Abhaya Indrayan
Editors of Basic and Applied Social Psychology created a furore in 2015 by banning the use of P-values by submitting authors. P-values have been under attack for long but this ban started a fresh debate on their relevance in empirical decisions. The American Statistical Association (ASA) seems to have buckled and issued a statement in March 2016 suggesting that P-values by themselves are of little value in inferences. ASA’s statement attracted widespread media coverage, and reports of non-replicability of many statistically validated research findings, particularly in medical and health sciences, provided credence to these allegations. The basic problem pointed out by critics is that P-values provide evidence against the null without telling us what they support.
Perhaps nobody will disagree that management of omni-present medical uncertainties requires a powerful tool that can quantify them and help us control their impact on our decisions. In my opinion, statistical methods are invaluable help in general in turning data into insight, and P-values in data-based testing of hypothesis situations in particular serve their purpose very well when used in a manner they are supposed to. Most of what I am trying to articulate in this note is known for statistical analysis but my attempt is to highlight the specific effect on P-values because of which these values become different, sometime vastly different, from what are reported.
There is an unnecessary controversy on what P-values stand for. They surely are for falsifying the null on the basis of the available data in place of validating the null. Isn’t this the usual way of reasoning in many situations? Court decisions are based on evidence against innocence and rarely in favour of innocence. Only the evidence placed before it is considered and other evidence lying in clouds, if any, is not a consideration. Can this judicial system be discredited? Then why P-values?
In the root of non-replicable research findings are not P-values but their misuse. Most empirical research findings are based on a conventional cut-point 0.05 without consideration of P-values used for other variables in the same investigation. We all know that such ‘multiple comparisons’ can have enormous deleterious effect. In addition, many investigations are based on previous results, which themselves are subject to such Type-I error. No adjustment for such double counting is done and the actual probability of Type-I error many times becomes much beyond the threshold without us realizing that this has happened. Unsurprisingly, the results fail to replicate.
P-values are meant to quantify the sampling error and nothing else. But they are wrongly used for many other ‘deficiencies’ in the investigation (I am a culprit myself!). They are meant for random samples, mostly simple random samples, but many studies use convenience sampling and come up with statistically significant results that have limited applicability, if at all. P-values are also mostly based on specified distribution such as Gaussian, which is simply assumed in many cases without realizing its repercussions on the results. In both these instances, the actual P-values are not what are reported. Minor violations can have butterfly effect. Most serious problem however is, what I call, epistemic uncertainties. All investigations are based on our existing knowledge, for example of risk factors of an outcome, but our knowledge is far too inadequate in most scientific endeavours and contribute to chance. How does that affect P-values is seldom discussed.
Medical literature is full of warning that P-values by themselves should not be used for decisions unless they are supported by biological processes. While this certainly is a sane advice and should be adhered to but unexpected significant findings that do not have biological justification at present need not be ignored. If the same finding is repeatedly seen in different settings, it seems prudent to believe it in the hope that biological justification may emerge later on. This invariably happens with newly emerging diseases where a particular sign-symptoms syndrome is observed to occur more than expected by chance and the causative agent is identified later. Thus it is not correct to say that P-values by themselves are of little value. They can be of enormous value in some cases.
Bias, either due to unaccounted confounders or because of intentional or unintentional prejudice in collection, recording, analysis and interpretation of data, is another source that tends to make the P-values unrealistic. Perhaps nothing can be done to alleviate this problem, and it irrepairably inflicts the decisions that are unnecessarily ascribed to P-values. The effect is sometimes not properly adjusted for known confounders too due to intricacies involved in computation, and simplistic analysis is done instead that fail to provide correct P-values. Missing values and errors in eliciting the information and recording many times go unnoticed in best of setups. Instruments generally used for obtaining the data are not sufficiently equipped to provide valid measurements. The care required to avoid such errors may not have been used in an investigation and the burden is unnecessarily placed on our dear P-values.
Till such time that a credible alternative emerges, we must continue to use P-values albeit with improved data and better methods, and in full consideration of multiple testing and epistemic uncertainties.
Abhaya Indrayan
Editors of Basic and Applied Social Psychology created a furore in 2015 by banning the use of P-values by submitting authors. P-values have been under attack for long but this ban started a fresh debate on their relevance in empirical decisions. The American Statistical Association (ASA) seems to have buckled and issued a statement in March 2016 suggesting that P-values by themselves are of little value in inferences. ASA’s statement attracted widespread media coverage, and reports of non-replicability of many statistically validated research findings, particularly in medical and health sciences, provided credence to these allegations. The basic problem pointed out by critics is that P-values provide evidence against the null without telling us what they support.
Perhaps nobody will disagree that management of omni-present medical uncertainties requires a powerful tool that can quantify them and help us control their impact on our decisions. In my opinion, statistical methods are invaluable help in general in turning data into insight, and P-values in data-based testing of hypothesis situations in particular serve their purpose very well when used in a manner they are supposed to. Most of what I am trying to articulate in this note is known for statistical analysis but my attempt is to highlight the specific effect on P-values because of which these values become different, sometime vastly different, from what are reported.
There is an unnecessary controversy on what P-values stand for. They surely are for falsifying the null on the basis of the available data in place of validating the null. Isn’t this the usual way of reasoning in many situations? Court decisions are based on evidence against innocence and rarely in favour of innocence. Only the evidence placed before it is considered and other evidence lying in clouds, if any, is not a consideration. Can this judicial system be discredited? Then why P-values?
In the root of non-replicable research findings are not P-values but their misuse. Most empirical research findings are based on a conventional cut-point 0.05 without consideration of P-values used for other variables in the same investigation. We all know that such ‘multiple comparisons’ can have enormous deleterious effect. In addition, many investigations are based on previous results, which themselves are subject to such Type-I error. No adjustment for such double counting is done and the actual probability of Type-I error many times becomes much beyond the threshold without us realizing that this has happened. Unsurprisingly, the results fail to replicate.
P-values are meant to quantify the sampling error and nothing else. But they are wrongly used for many other ‘deficiencies’ in the investigation (I am a culprit myself!). They are meant for random samples, mostly simple random samples, but many studies use convenience sampling and come up with statistically significant results that have limited applicability, if at all. P-values are also mostly based on specified distribution such as Gaussian, which is simply assumed in many cases without realizing its repercussions on the results. In both these instances, the actual P-values are not what are reported. Minor violations can have butterfly effect. Most serious problem however is, what I call, epistemic uncertainties. All investigations are based on our existing knowledge, for example of risk factors of an outcome, but our knowledge is far too inadequate in most scientific endeavours and contribute to chance. How does that affect P-values is seldom discussed.
Medical literature is full of warning that P-values by themselves should not be used for decisions unless they are supported by biological processes. While this certainly is a sane advice and should be adhered to but unexpected significant findings that do not have biological justification at present need not be ignored. If the same finding is repeatedly seen in different settings, it seems prudent to believe it in the hope that biological justification may emerge later on. This invariably happens with newly emerging diseases where a particular sign-symptoms syndrome is observed to occur more than expected by chance and the causative agent is identified later. Thus it is not correct to say that P-values by themselves are of little value. They can be of enormous value in some cases.
Bias, either due to unaccounted confounders or because of intentional or unintentional prejudice in collection, recording, analysis and interpretation of data, is another source that tends to make the P-values unrealistic. Perhaps nothing can be done to alleviate this problem, and it irrepairably inflicts the decisions that are unnecessarily ascribed to P-values. The effect is sometimes not properly adjusted for known confounders too due to intricacies involved in computation, and simplistic analysis is done instead that fail to provide correct P-values. Missing values and errors in eliciting the information and recording many times go unnoticed in best of setups. Instruments generally used for obtaining the data are not sufficiently equipped to provide valid measurements. The care required to avoid such errors may not have been used in an investigation and the burden is unnecessarily placed on our dear P-values.
Till such time that a credible alternative emerges, we must continue to use P-values albeit with improved data and better methods, and in full consideration of multiple testing and epistemic uncertainties.