Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

Ramit Pahwa; Apoorva Beedu; Parivesh Priye; Rutu Gandhi; Saloni Takawale; Aruna Baijal; Zengli Yang

arXiv:2604.22821·cs.SD·April 29, 2026

Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

Ramit Pahwa, Apoorva Beedu, Parivesh Priye, Rutu Gandhi, Saloni Takawale, Aruna Baijal, Zengli Yang

PDF

1 Repo

TL;DR

Audio2Tool introduces a comprehensive dataset and benchmark to evaluate SpeechLMs' ability to handle diverse, complex, and noisy speech tool-calling tasks across multiple domains.

Contribution

It provides a large-scale, multi-domain dataset with varying complexity and realistic noise to evaluate and improve speech tool use in SpeechLMs.

Findings

01

State-of-the-art models perform well on simple commands.

02

Performance drops significantly with complex, multi-intent, and noisy conditions.

Abstract

Voice assistants increasingly rely on Speech Language Models (SpeechLMs) to interpret spoken queries and execute complex tasks, yet existing benchmarks lack domain breadth, acoustic diversity, and compositional reasoning complexity to evaluate tool-calling performance. We introduce Audio2Tool, a large-scale dataset comprising approximately 30,000 queries designed to assess tool-calling capabilities of SpeechLMs across three primary domains: Smart Car, Smart Home, and Wearables. Our benchmark features a multi-tier complexity hierarchy, ranging from simple direct commands to complex multi-intent and needle-in-a-haystack extraction to isolate distinct failure modes. To ensure realism, we employ zero-shot voice cloning text-to-speech synthesis and diverse noise profiles to simulate in-the-wild conditions. Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines show strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://audio2tool.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.