Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding

Khoa Vo; Taisei Hanyu; Yuki Ikebe; Trong Thang Pham; Nhat Chung; Minh Nhat Vu; Duy Nguyen Ho Minh; Anh Nguyen; Anthony Gunderman; Chase Rainwater; Ngan Le

arXiv:2512.22519·cs.RO·April 27, 2026

Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding

Khoa Vo, Taisei Hanyu, Yuki Ikebe, Trong Thang Pham, Nhat Chung, Minh Nhat Vu, Duy Nguyen Ho Minh, Anh Nguyen, Anthony Gunderman, Chase Rainwater, Ngan Le

PDF

TL;DR

This paper introduces OBEYED-VLA, a framework that enhances vision-language-action models for robotic manipulation by explicitly disentangling perception and control through object-centric and geometry-aware grounding, improving robustness in cluttered environments.

Contribution

It proposes a novel perception module that grounds multi-view inputs into object-centric and geometry-aware observations, improving the robustness and generalization of VLA policies in cluttered settings.

Findings

01

Substantially improves robustness over baselines in cluttered environments.

02

Both semantic and geometric grounding are critical for performance gains.

03

Achieves better target rejection and handling of unseen objects.

Abstract

Recent Vision-Language-Action (VLA) models have made impressive progress toward general-purpose robotic manipulation by post-training large Vision-Language Models (VLMs) for action prediction. Yet most VLAs entangle perception and control in a monolithic pipeline optimized purely for action, which can erode language-conditioned grounding. In our real-world tabletop tests, policies over-grasp when the target is absent, are distracted by clutter, and overfit to background appearance. To address these issues, we propose OBEYED-VLA (OBject-centric and gEometrY groundED VLA), a framework that explicitly disentangles perceptual grounding from action reasoning. Instead of operating directly on raw RGB, OBEYED-VLA augments VLAs with a perception module that grounds multi-view inputs into task-conditioned, object-centric, and geometry-aware observations. This module includes a VLM-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.