RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Gabriele Mattioli; Evelyn Turri; Sara Sarto; Lorenzo Baraldi; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara

arXiv:2604.14951·cs.CV·April 17, 2026

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Gabriele Mattioli, Evelyn Turri, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

PDF

TL;DR

RaTA-Tool introduces a retrieval-based framework enabling multimodal large language models to select external tools in open-world settings by converting multimodal queries into structured descriptions and matching them with rich tool metadata.

Contribution

The paper proposes a novel retrieval-based approach for open-world multimodal tool selection, supporting extensibility and improved performance without retraining.

Findings

01

Significantly improves tool-selection accuracy in multimodal, open-world scenarios.

02

Supports adding new tools without retraining the model.

03

Introduces the first dataset for open-world multimodal tool use.

Abstract

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.