I Can't Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification
Daniel Nobrega Medeiros

TL;DR
This study systematically shows that test-time augmentation often degrades accuracy in medical image classification, challenging common assumptions and emphasizing the need for validation before use.
Contribution
It provides the first comprehensive empirical analysis demonstrating that TTA can harm performance in medical imaging, highlighting the importance of augmentation strategy and distribution shift effects.
Findings
TTA with standard augmentation degrades accuracy in most cases.
Augmentation intensity and inclusion of original images influence performance.
Distribution shift caused by augmentation and batch normalization mismatch is the main factor.
Abstract
Test-time augmentation (TTA)--aggregating predictions over multiple augmented copies of a test input--is widely assumed to improve classification accuracy, particularly in medical imaging where it is routinely deployed in production systems and competition solutions. We present a systematic empirical study challenging this assumption across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M). Our principal finding is that TTA with standard augmentation pipelines consistently degrades accuracy relative to single-pass inference, with drops as severe as 31.6 percentage points for ResNet-18 on pathology images. This degradation affects all architectures, including convolutional models, and worsens with more augmented views. The sole exception is ResNet-18 on dermatology images, which gains a modest +1.6%. We identify the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
