TL;DR
AVA-VLA introduces a recurrent state and active visual attention to improve vision-language-action models for robotic tasks, achieving state-of-the-art results and better real-world transfer.
Contribution
It reformulates VLA policy learning as a POMDP and proposes AVA-VLA with active visual attention for temporally grounded visual processing.
Findings
State-of-the-art performance on LIBERO and CALVIN benchmarks.
Effective transfer to real-world dual-arm manipulation tasks.
Improved handling of partial observability in robotic decision-making.
Abstract
Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
