TL;DR
This paper introduces SMIT, a self-supervised learning method using masked image modeling and self-distillation for 3D medical image segmentation with vision transformers, reducing data needs and improving accuracy.
Contribution
The paper proposes a novel self-distillation masked image transformer (SMIT) approach for SSL in 3D medical segmentation, combining dense pixel-wise prediction with token distillation.
Findings
Achieved high accuracy with average DSC of 0.875 (MRI) and 0.878 (CT).
Required fewer fine-tuning datasets than other methods.
Validated across multiple organs and imaging modalities.
Abstract
Vision transformers, with their ability to more efficiently model long-range context, have demonstrated impressive accuracy gains in several computer vision and medical image analysis tasks including segmentation. However, such methods need large labeled datasets for training, which is hard to obtain for medical image analysis. Self-supervised learning (SSL) has demonstrated success in medical image segmentation using convolutional networks. In this work, we developed a \underline{s}elf-distillation learning with \underline{m}asked \underline{i}mage modeling method to perform SSL for vision \underline{t}ransformers (SMIT) applied to 3D multi-organ segmentation from CT and MRI. Our contribution is a dense pixel-wise regression within masked patches called masked image prediction, which we combined with masked patch token distillation as pretext task to pre-train vision transformers. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
