Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies
Jess Jones, Raul Santos-Rodriguez, Sabine Hauert

TL;DR
This paper evaluates the ability of vision-language models to infer affordances for robots with non-humanoid shapes, highlighting their strengths and limitations across diverse object categories and robot forms.
Contribution
It introduces a hybrid dataset combining real and synthetic scenarios and provides an empirical analysis of VLM performance on non-humanoid robotic affordance inference.
Findings
VLMs show promising generalisation to non-humanoid robots.
Performance varies significantly across object categories.
VLMs tend to have low false positives but high false negatives, especially in novel tool use scenarios.
Abstract
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding human-object interactions, but their application to robotic systems with non-humanoid morphologies remains largely unexplored. This work investigates whether VLMs can effectively infer affordances for robots with fundamentally different embodiments than humans, addressing a critical gap in the deployment of these models for diverse robotic applications. We introduce a novel hybrid dataset that combines annotated real-world robotic affordance-object relations with VLM-generated synthetic scenarios, and perform an empirical analysis of VLM performance across multiple object categories and robot morphologies, revealing significant variations in affordance inference capabilities. Our experiments demonstrate that while VLMs show promising generalisation to non-humanoid robot forms, their performance is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
