TL;DR
RD-ViT introduces a recurrent-depth architecture for vision transformers that reduces data dependence and enhances efficiency in dense prediction tasks like cardiac MRI segmentation.
Contribution
It adapts the Recurrent-Depth Transformer architecture to dense prediction, incorporating state injection, adaptive computation, and mixture-of-experts for improved performance and efficiency.
Findings
Outperforms standard ViT with only 10% training data
Achieves near-standard ViT performance with fewer parameters in 3D
Experts specialize for different cardiac structures without supervision
Abstract
Vision Transformers (ViTs) achieve state-of-the-art segmentation accuracy but require large training datasets because each layer has unique parameters that must be learned independently. We present RD-ViT, a Recurrent-Depth Vision Transformer that adapts the Recurrent-Depth Transformer (RDT) architecture to dense prediction tasks, supporting both 2D and 3D inputs. RD-ViT replaces the deep stack of unique transformer blocks with a single shared block looped T times, augmented with LTI-stable state injection for guaranteed convergence, Adaptive Computation Time (ACT) for spatial compute allocation, depth-wise LoRA adaptation, and optional Mixture-of-Experts (MoE) feed-forward networks for category-specific specialization. We evaluate on the ACDC cardiac MRI segmentation benchmark in both 2D slice-level and 3D volumetric settings with exclusively real experiments executed in Google Colab.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
