Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Yalcin Tur; Jalal Naghiyev; Haoquan Fang; Wei-Chuan Tsai; Jiafei Duan; Dieter Fox; Ranjay Krishna

arXiv:2602.07845·cs.RO·February 10, 2026

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Yalcin Tur, Jalal Naghiyev, Haoquan Fang, Wei-Chuan Tsai, Jiafei Duan, Dieter Fox, Ranjay Krishna

PDF

Open Access

TL;DR

RD-VLA introduces a recurrent architecture for vision-language-action models that adaptively scales compute at test time through latent iterative refinement, enabling efficient and scalable reasoning in robotics tasks.

Contribution

It proposes a novel recurrent, weight-tied action head that allows variable inference depth with constant memory, replacing token-based reasoning with latent iterative refinement.

Findings

01

Recurrent depth improves success rates from 0% to over 90% in complex tasks.

02

Constant memory usage enables up to 80x faster inference.

03

Adaptive stopping criterion effectively allocates compute based on task complexity.

Abstract

Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications