TL;DR
LaRA-VLA introduces a continuous latent reasoning framework for vision-language-action models, reducing inference latency and improving performance in embodied tasks by internalizing multi-modal reasoning.
Contribution
It proposes a unified latent reasoning approach that replaces explicit chain-of-thought generation, with a curriculum-based training paradigm for efficient real-time embodied control.
Findings
Outperforms state-of-the-art VLA methods on benchmarks
Reduces inference latency by up to 90%
Effective in long-horizon real-robot manipulation tasks
Abstract
Vision-Language-Action (VLA) models benefit from chain-of-thought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control. We propose Latent Reasoning VLA (LaRA-VLA), a unified VLA framework that internalizes multi-modal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, action-oriented control. To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation. We construct two structured CoT datasets and evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
