Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
Angelos Mavrogiannis, Dehao Yuan, Yiannis Aloimonos

TL;DR
This paper introduces a perception-action API combining vision language models and large language models to actively identify non-visual object attributes through visual reasoning and robot control, outperforming traditional methods.
Contribution
The novel perception-action API enables active perception of non-visual attributes using LLMs and VLMs, advancing grounding capabilities beyond visual features.
Findings
Outperforms vanilla VLMs in attribute detection on Odd-One-Out dataset
Effective in household scenes and real robot demonstrations
Demonstrates active perception improves attribute grounding
Abstract
There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks
