Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
Federico Tavella, Amber Drinkwater, Angelo Cangelosi

TL;DR
This paper evaluates the robustness of Vision-Language Models in robotic scene understanding under domain shifts, revealing significant performance degradation and evaluation vulnerabilities.
Contribution
It introduces a controlled physical domain shift in robotic scene understanding and benchmarks VLMs, exposing their limitations and vulnerabilities in real-world robotic applications.
Findings
VLM performance drops on 3D-printed objects despite structural similarity.
Standard metrics may fail to detect domain shifts or reward incorrect captions.
VLMs describe common objects well but struggle with textured, material, or color differences.
Abstract
Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
