Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks
Jue Jiang, Harini Veeraraghavan

TL;DR
This study benchmarks various self-supervised learning methods for 3D medical image segmentation, highlighting SMIT's superior accuracy, efficiency, and transferability across modalities and data sizes.
Contribution
It provides a comprehensive comparison of nine SSL methods on multiple medical segmentation tasks, emphasizing the effectiveness of MIM-based approaches like SMIT.
Findings
SMIT achieved highest accuracy and fastest convergence.
MIM-based methods outperformed contrastive and rotation prediction.
SSL method choice impacts performance most in few-shot scenarios.
Abstract
Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
