Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

Guangfu Hao; Haojie Wen; Liangxuan Guo; Yang Chen; Yanchao Bi; Shan Yu

arXiv:2505.22146·cs.CV·August 22, 2025

Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

Guangfu Hao, Haojie Wen, Liangxuan Guo, Yang Chen, Yanchao Bi, Shan Yu

PDF

Open Access

TL;DR

This paper introduces a low-dimensional attribute alignment framework for flexible tool selection, combining visual and linguistic cues to mimic human cognition with high accuracy and efficiency.

Contribution

It presents a novel, parameter-efficient approach using attribute representations to improve multimodal tool selection, outperforming smaller models and approaching GPT-4's performance.

Findings

01

Achieves 74% accuracy in tool selection tasks.

02

Outperforms direct matching and smaller models significantly.

03

Validates alignment with human decision-making patterns.

Abstract

Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Action Observation and Synchronization · Interactive and Immersive Displays

MethodsLLaMA