From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Md Tahmid Rahman Laskar; Xue-Yong Fu; Seyyed Saeed Sarfjoo; Quinten McNamara; Jonas Robertson; Shashi Bhushan TN

arXiv:2605.15104·cs.CL·May 21, 2026

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Md Tahmid Rahman Laskar, Xue-Yong Fu, Seyyed Saeed Sarfjoo, Quinten McNamara, Jonas Robertson, Shashi Bhushan TN

PDF

TL;DR

This paper introduces a framework to convert text-based tool-calling benchmarks into audio-based evaluations for voice agents, enabling reliable assessment of speech tool use without re-annotating datasets.

Contribution

It presents a dataset-agnostic, reproducible method using text-to-speech and environmental noise to evaluate multimodal models on audio tool-calling benchmarks.

Findings

01

Model performance varies significantly across tasks and models.

02

Degradation mainly due to misunderstandings of argument values in speech.

03

Open-source judges with 8B+ parameters align well with human preferences.

Abstract

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.