AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Lei Xiao; Jifeng Li; Juntao Gao; Feiyang Ye; Yan Jin; Jingjing Qian; Jing Zhang; Yong Wu; Xiaoyuan Yu

arXiv:2511.18960·cs.LG·April 13, 2026

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu

PDF

1 Repo 2 Models

TL;DR

AVA-VLA introduces a recurrent state and active visual attention to improve vision-language-action models for robotic tasks, achieving state-of-the-art results and better real-world transfer.

Contribution

It reformulates VLA policy learning as a POMDP and proposes AVA-VLA with active visual attention for temporally grounded visual processing.

Findings

01

State-of-the-art performance on LIBERO and CALVIN benchmarks.

02

Effective transfer to real-world dual-arm manipulation tasks.

03

Improved handling of partial observability in robotic decision-making.

Abstract

Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://liauto-dsr.github.io/AVA-VLA-Page
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.