Recursive Belief Vision Language Action Models

Vaidehi Bagaria; Bijo Sebastian; Nirav Kumar Patel

arXiv:2602.20659·cs.AI·February 26, 2026

Recursive Belief Vision Language Action Models

Vaidehi Bagaria, Bijo Sebastian, Nirav Kumar Patel

PDF

Open Access

TL;DR

This paper introduces RB-VLA, a belief-centric vision-language-action model that maintains persistent, action-conditioned state representations for long-horizon tasks, significantly improving success rates and reducing inference latency in complex manipulation scenarios.

Contribution

RB-VLA is the first belief-based architecture for vision-language-action models that effectively handles long-horizon, multi-stage tasks under partial observability.

Findings

01

Outperforms prior VLAs on long-horizon benchmarks with 52.5% and 37.5% higher success rates.

02

Reduces inference latency by up to five times compared to baselines.

03

Belief module is crucial, increasing success rates from 32.5% to 77.5%.

Abstract

Vision-language-action models must enable agents to execute long-horizon tasks under partial observability. However, most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. While semantic grounding is important, long-horizon manipulation fundamentally requires persistent, action-conditioned state representations. Current VLAs lack such representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once per task, the VLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning