Robust Skills, Brittle Grounding: Diagnosing Restricted Generalization in Vision-Language Action Policies via Multi-Object Picking

David Emukpere; Romain Deffayet; Jean-Michel Renders

arXiv:2602.24143·cs.RO·March 2, 2026

Robust Skills, Brittle Grounding: Diagnosing Restricted Generalization in Vision-Language Action Policies via Multi-Object Picking

David Emukpere, Romain Deffayet, Jean-Michel Renders

PDF

Open Access

TL;DR

This paper investigates whether vision-language action policies truly understand object grounding or rely on superficial correlations, revealing that primitive manipulation skills are more robust than instruction following in challenging scenarios.

Contribution

The study introduces controlled stress tests for VLA policies, demonstrating the decoupling of manipulation skills from instruction grounding and proposing improved benchmarking methods.

Findings

01

Manipulation primitives are more reliable than instruction-conditioned success in complex settings.

02

VLA policies rely on object-location correlations that do not transfer beyond training.

03

Augmenting benchmarks with task ladders improves diagnosis of instruction-grounded generalization.

Abstract

Vision-language action (VLA) policies often report strong manipulation benchmark performance with relatively few demonstrations, but it remains unclear whether this reflects robust language-to-object grounding or reliance on object--location correlations that do not transfer beyond the training distribution. We present a controlled multi-object picking study that progressively increases object placement variability up to full workspace randomization and evaluates held-out object--location pairings that break familiar associations without increasing spatial difficulty. Across these stress tests and data scaling, we find that for representative VLA policies, including SmolVLA and $π_{0.5}$ , execution of the manipulation primitive remains substantially more reliable than instruction-conditioned task success in harder regimes, suggesting that manipulation skill acquisition is decoupled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics