LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

Boyang Shen; Kaixiang Yang; Hao Wang; Qiuyu Yu; Qiang Xie; Qiang Li; Zhiwei Wang

arXiv:2605.09948·cs.AI·May 12, 2026

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

Boyang Shen, Kaixiang Yang, Hao Wang, Qiuyu Yu, Qiang Xie, Qiang Li, Zhiwei Wang

PDF

TL;DR

LoopVLA introduces a recurrent architecture that learns when to stop refining representations in vision-language-action tasks, improving efficiency and performance in robotic manipulation.

Contribution

It proposes a novel self-supervised sufficiency estimation method within a recurrent VLA model, enabling adaptive refinement and reducing computation.

Findings

01

Reduces model parameters by 45%.

02

Increases inference throughput by up to 1.7 times.

03

Matches or outperforms strong baselines in task success.

Abstract

Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.