Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language
Guangfu Hao, Haojie Wen, Liangxuan Guo, Yang Chen, Yanchao Bi, Shan Yu

TL;DR
This paper introduces a low-dimensional attribute alignment framework for flexible tool selection, combining visual and linguistic cues to mimic human cognition with high accuracy and efficiency.
Contribution
It presents a novel, parameter-efficient approach using attribute representations to improve multimodal tool selection, outperforming smaller models and approaching GPT-4's performance.
Findings
Achieves 74% accuracy in tool selection tasks.
Outperforms direct matching and smaller models significantly.
Validates alignment with human decision-making patterns.
Abstract
Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Action Observation and Synchronization · Interactive and Immersive Displays
MethodsLLaMA
