RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction

Renjie He

arXiv:2605.03999·cs.CV·May 6, 2026

RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction

Renjie He

PDF

1 Repo

TL;DR

RD-ViT introduces a recurrent-depth architecture for vision transformers that reduces data dependence and enhances efficiency in dense prediction tasks like cardiac MRI segmentation.

Contribution

It adapts the Recurrent-Depth Transformer architecture to dense prediction, incorporating state injection, adaptive computation, and mixture-of-experts for improved performance and efficiency.

Findings

01

Outperforms standard ViT with only 10% training data

02

Achieves near-standard ViT performance with fewer parameters in 3D

03

Experts specialize for different cardiac structures without supervision

Abstract

Vision Transformers (ViTs) achieve state-of-the-art segmentation accuracy but require large training datasets because each layer has unique parameters that must be learned independently. We present RD-ViT, a Recurrent-Depth Vision Transformer that adapts the Recurrent-Depth Transformer (RDT) architecture to dense prediction tasks, supporting both 2D and 3D inputs. RD-ViT replaces the deep stack of unique transformer blocks with a single shared block looped T times, augmented with LTI-stable state injection for guaranteed convergence, Adaptive Computation Time (ACT) for spatial compute allocation, depth-wise LoRA adaptation, and optional Mixture-of-Experts (MoE) feed-forward networks for category-specific specialization. We evaluate on the ACDC cardiac MRI segmentation benchmark in both 2D slice-level and 3D volumetric settings with exclusively real experiments executed in Google Colab.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.