Systematic Misestimation of Machine Learning Performance in Neuroimaging   Studies of Depression

Claas Flint; Micah Cearns; Nils Opel; Ronny Redlich; David M. A.; Mehler; Daniel Emden; Nils R. Winter; Ramona Leenings; Simon B. Eickhoff,; Tilo Kircher; Axel Krug; Igor Nenadic; Volker Arolt; Scott Clark; Bernhard T.; Baune; Xiaoyi Jiang; Udo Dannlowski; Tim Hahn

arXiv:1912.06686·q-bio.NC·June 23, 2021

Systematic Misestimation of Machine Learning Performance in Neuroimaging Studies of Depression

Claas Flint, Micah Cearns, Nils Opel, Ronny Redlich, David M. A., Mehler, Daniel Emden, Nils R. Winter, Ramona Leenings, Simon B. Eickhoff,, Tilo Kircher, Axel Krug, Igor Nenadic, Volker Arolt, Scott Clark, Bernhard T., Baune, Xiaoyi Jiang, Udo Dannlowski, Tim Hahn

PDF

1 Repo

TL;DR

This study reveals that small sample sizes in neuroimaging machine learning studies of depression often lead to overestimated performance metrics, highlighting the importance of larger test sets for valid results.

Contribution

It systematically demonstrates the risk of performance misestimation in small samples and emphasizes the need for larger test sets to improve reliability in neuroimaging ML studies.

Findings

01

Small samples can produce inflated accuracy estimates.

02

Large test sets mitigate performance overestimation.

03

Current literature may overstate ML performance due to small sample sizes.

Abstract

We currently observe a disconcerting phenomenon in machine learning studies in psychiatry: While we would expect larger samples to yield better results due to the availability of more data, larger machine learning studies consistently show much weaker performance than the numerous small-scale studies. Here, we systematically investigated this effect focusing on one of the most heavily studied questions in the field, namely the classification of patients suffering from major depressive disorder (MDD) and healthy control (HC) based on neuroimaging data. Drawing upon structural magnetic resonance imaging (MRI) data from a balanced sample of $N = 1, 868$ MDD patients and HC from our recent international Predictive Analytics Competition (PAC), we first trained and tested a classification model on the full dataset which yielded an accuracy of $61 %$ . Next, we mimicked the process by which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cl445/misestimation_mri_mdd
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest