Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs
Finn Behrendt, Debayan Bhattacharya, Julia Kr\"uger, Roland Opfer,, Alexander Schlaefer

TL;DR
This paper compares the performance of Vision Transformers and CNNs for multi-label disease classification on chest radiographs, highlighting data efficiency and the advantages of DeiT variants with larger datasets.
Contribution
It systematically evaluates ViTs and CNNs on chest X-ray classification, demonstrating DeiT's superior data efficiency and performance with larger datasets.
Findings
ViTs perform comparably to CNNs on small datasets.
DeiT variants outperform ViTs with larger datasets.
Data efficiency of ViTs improves with dataset size.
Abstract
Radiographs are a versatile diagnostic tool for the detection and assessment of pathologies, for treatment planning or for navigation and localization purposes in clinical interventions. However, their interpretation and assessment by radiologists can be tedious and error-prone. Thus, a wide variety of deep learning methods have been proposed to support radiologists interpreting radiographs. Mostly, these approaches rely on convolutional neural networks (CNN) to extract features from images. Especially for the multi-label classification of pathologies on chest radiographs (Chest X-Rays, CXR), CNNs have proven to be well suited. On the Contrary, Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images and interpretable local saliency maps which could add value to clinical interventions. ViTs do not rely on convolutions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
