Although the vast majority of published fMRI studies are carefully conducted, some contain serious statistical or design flaws raising doubts about their validity. At times it seems that these "junk" papers are the very ones most likely to be picked up by the popular press! Because methodological errors can be difficult to recognize, I believe it is wise to initially view all fMRI studies with a critical eye toward their design, analysis, and conclusions. A few of the more common errors and pitfalls are described below.
Statistical Errors Leading to High False Positive Rates
All fMRI studies must strike a delicate balance between excluding false positives (saying an area activates when it does not) and accepting false negatives (considering an area to be silent when it really does activate). The harder you protect against one type of error, the more you will have of the other.
The false positive rate is typically controlled at the voxel level by selection of an arbitrary p-value threshold. In non-fMRI experiments p = 0.05 (i.e., allowing 5% false positives) is often considered a reasonable choice. However, such a modest p-value proves grossly inadequate for fMRI studies where 100,000 or more voxels must be simultaneously tested. If the p-value were set to 0.05 for a single voxel, then as many as 5000 of the 100,000 total voxels in a study (100,000 x 0.05) would potentially appear falsely activated. This issue is known in statistics as the multiple comparisons problem and is amenable to several correction strategies that reduce the effective p-value to 0.001 or below.
Surprisingly, many early fMRI papers did not make appropriate statistical corrections for multiple comparisons. This issue was brought to light by Craig Bennett in his famous fMRI study of a dead salmon "viewing" photos of human faces. False positive activation was detected in the salmon's brain when multiple comparison statistics were not properly performed.
Bennett and his salmon sufficiently shamed the neuroimaging community so that multiple comparison corrections are now nearly universally employed for fMRI analysis. But just as this statistical error has been put to rest, another has recently surfaced. In 2016 Eklund et al. reported that several commonly used methods of cluster analysis resulted in significant inflation of false positive results. One of the most popular algorithms (AFNI's 3dClust Sim) was even discovered to contain an unrecognized 15-year-old software "bug" (now corrected). The implications are staggering -- improper cluster analysis may potentially affect conclusions from up to 40,000 fMRI studies published between 2000 and 2015.
Test-Retest Unreliability
In 2020 major meta-analyses of task-based fMRI studies were published by two independent groups of investigators. In one paper the same fMRI dataset was given to 70 different experienced groups world-wide to analyze and test hypotheses about, with astonishingly different results. Both papers came to the same conclusion -- that the test-retest reliability of task-based fMRI was very poor, far below recommended cutoff levels for clinical application or individual test comparisons. The second paper led its Duke researcher author to question 15 years of his own work. One commentator wrote, "The dead salmon has lots of company".
Circularity Errors ("Voodoo Correlations")
In 2009 Vul et al. pointed out that more than 50 high-profile publications linking fMRI activation with measures of emotion, personality, and social cognition demonstrated unbelievably high ("voodoo") correlation values. A careful analysis of experimental methods provided the explanation — many investigators first preselected groups of voxels appearing to be the most active and then performed statistical analysis on these voxels only. Because the reported voxels were not truly independent, the calculated correlation values were falsely elevated. The methodological error is a form of circular reasoning that can still be encountered in fMRI studies. Voxels or regions of interest should always be selected before (not after) the experiment has been conducted.
Other fMRI Pitfalls and Fallacies
Below are a few additional potential errors, pitfalls, and reminders to consider before blindly accepting the conclusions of an fMRI experiment:
- Voxels that fail to reach the statistical threshold for significance are not necessarily inactive, as most fMRI studies are only weakly powered with high false negative rates.
- Statistical significance does not imply causation.
- Analysis of pooled data from multiple subjects is filled with potential statistical pitfalls and suffers from imprecise anatomical localization.
- Beware studies using either very high or very low statistical thresholds. Very high (strict) thresholds tend to collapse activation into just a few dominant regions and favor sensorimotor paradigms (which have intrinsically stronger responses than cognitive or emotive ones). Very low thresholds allow in far too many false positives which may may go unrecognized if techniques such as white matter masking are employed
- Reverse inference experiments are treacherous to interpret. Some areas of the brain are associated with diverse and even conflicting emotions such as pain, fear, love, worry, and anticipation.
Advanced Discussion (show/hide)»
No supplementary material yet. Check back soon!
References
Bennett CM, Baird AA, Miller MB, Wolford GL. Neural correlates of interspecies perspective taking in the post-mortem Atlantic salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results 2010; 1:1-5 (jsur.org).
Bennett CM, Wolford GL, Miller MB. The principled control of false positives in neuroimaging. Soc Cogn Affect Neurosci. 2009;4:417–22.
Botvinik-Nezer R, Holzmeister F, Camerer CR, et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 2020; 582:84-88. (Same fMRI dataset processed by 70 different groups, with diverse results).
Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci 2014; 1:140216.
Eklund A, Nichols TE, Knutsson H. Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. Proc Nat Acad Sci USA 2016; 113:7900-7905. (Important paper critiquing methods of cluster analysis)
Elliott ML, Knodt AR, Ireland D, et al. What Is the test-retest reliability of common task-functional MRI measures? New empirical evidence and a meta-analysis. Psychological Science 2020:1-15; DOI: 10.1177/0956797620916786. (Duke
Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol 2008; 45:135-140.
Nichols TE. Multiple testing corrections, nonparametric methods, and random field theory. NeuroImage 2012; 62:811-815.
Nichols T, Hayasaka S. Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Methods Med Res 2003; 12:419-446.
Vul E, Harris C, Winkielman P, Pashler H. Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspect Psychol Sci 2009; 4:274-290. (Paper was originally known as "Voodoo Correlations in Social Neuroscience")
Bennett CM, Baird AA, Miller MB, Wolford GL. Neural correlates of interspecies perspective taking in the post-mortem Atlantic salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results 2010; 1:1-5 (jsur.org).
Bennett CM, Wolford GL, Miller MB. The principled control of false positives in neuroimaging. Soc Cogn Affect Neurosci. 2009;4:417–22.
Botvinik-Nezer R, Holzmeister F, Camerer CR, et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 2020; 582:84-88. (Same fMRI dataset processed by 70 different groups, with diverse results).
Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci 2014; 1:140216.
Eklund A, Nichols TE, Knutsson H. Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. Proc Nat Acad Sci USA 2016; 113:7900-7905. (Important paper critiquing methods of cluster analysis)
Elliott ML, Knodt AR, Ireland D, et al. What Is the test-retest reliability of common task-functional MRI measures? New empirical evidence and a meta-analysis. Psychological Science 2020:1-15; DOI: 10.1177/0956797620916786. (Duke
Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol 2008; 45:135-140.
Nichols TE. Multiple testing corrections, nonparametric methods, and random field theory. NeuroImage 2012; 62:811-815.
Nichols T, Hayasaka S. Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Methods Med Res 2003; 12:419-446.
Vul E, Harris C, Winkielman P, Pashler H. Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspect Psychol Sci 2009; 4:274-290. (Paper was originally known as "Voodoo Correlations in Social Neuroscience")
Related Questions
How are those activation "blobs" on an fMRI image created, and what exactly do they represent?
We often see areas of fMRI activation outside the brain itself. Why do these occur and what can be done to correct them?
How are those activation "blobs" on an fMRI image created, and what exactly do they represent?
We often see areas of fMRI activation outside the brain itself. Why do these occur and what can be done to correct them?