Recursive Belief Vision Language Action Models
Vaidehi Bagaria, Bijo Sebastian, Nirav Kumar Patel

TL;DR
This paper introduces RB-VLA, a belief-centric vision-language-action model that maintains persistent, action-conditioned state representations for long-horizon tasks, significantly improving success rates and reducing inference latency in complex manipulation scenarios.
Contribution
RB-VLA is the first belief-based architecture for vision-language-action models that effectively handles long-horizon, multi-stage tasks under partial observability.
Findings
Outperforms prior VLAs on long-horizon benchmarks with 52.5% and 37.5% higher success rates.
Reduces inference latency by up to five times compared to baselines.
Belief module is crucial, increasing success rates from 32.5% to 77.5%.
Abstract
Vision-language-action models must enable agents to execute long-horizon tasks under partial observability. However, most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. While semantic grounding is important, long-horizon manipulation fundamentally requires persistent, action-conditioned state representations. Current VLAs lack such representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once per task, the VLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
