LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
Boyang Shen, Kaixiang Yang, Hao Wang, Qiuyu Yu, Qiang Xie, Qiang Li, Zhiwei Wang

TL;DR
LoopVLA introduces a recurrent architecture that learns when to stop refining representations in vision-language-action tasks, improving efficiency and performance in robotic manipulation.
Contribution
It proposes a novel self-supervised sufficiency estimation method within a recurrent VLA model, enabling adaptive refinement and reducing computation.
Findings
Reduces model parameters by 45%.
Increases inference throughput by up to 1.7 times.
Matches or outperforms strong baselines in task success.
Abstract
Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
