ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

Wenxuan Song; Ziyang Zhou; Han Zhao; Jiayi Chen; Pengxiang Ding; Haodong Yan; Yuxin Huang; Feilong Tang; Donglin Wang; Haoang Li

arXiv:2508.10333·cs.RO·August 15, 2025

ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, Haoang Li

PDF

1 Models 1 Video

TL;DR

ReconVLA introduces a reconstructive vision-language-action model that improves visual attention grounding and manipulation precision in robots by using a diffusion transformer to reconstruct gaze regions, trained on a large-scale dataset.

Contribution

The paper presents ReconVLA, a novel reconstructive VLA model with implicit grounding, and a large-scale pretraining dataset to enhance visual attention and manipulation accuracy in robotic agents.

Findings

01

ReconVLA outperforms existing models in simulation and real-world tasks.

02

Implicit grounding improves visual attention allocation.

03

Large-scale pretraining enhances generalization and reconstruction quality.

Abstract

Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zzyzyzy/ReconVLA
model

Videos

ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver· underline