TL;DR
ReconVLA introduces a reconstructive vision-language-action model that improves visual attention grounding and manipulation precision in robots by using a diffusion transformer to reconstruct gaze regions, trained on a large-scale dataset.
Contribution
The paper presents ReconVLA, a novel reconstructive VLA model with implicit grounding, and a large-scale pretraining dataset to enhance visual attention and manipulation accuracy in robotic agents.
Findings
ReconVLA outperforms existing models in simulation and real-world tasks.
Implicit grounding improves visual attention allocation.
Large-scale pretraining enhances generalization and reconstruction quality.
Abstract
Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
