Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin; Chen Chen; Zhehuai Chen; Hung-yi Lee

arXiv:2604.04847·eess.AS·April 7, 2026

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee

PDF

TL;DR

This paper introduces FDB-v3, a comprehensive benchmark dataset and evaluation of spoken language models in realistic, disfluent speech scenarios involving multi-step tool use, highlighting strengths and weaknesses of current models.

Contribution

The work presents a new dataset with real disfluencies and multi-step tasks, along with an extensive evaluation of six models on accuracy, latency, and turn-taking.

Findings

01

GPT-Realtime achieves highest accuracy and best interruption avoidance.

02

Gemini Live 3.1 has the fastest response latency.

03

The Cascaded pipeline has perfect turn-taking but highest latency.

Abstract

We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations -- GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper $\to$ GPT-4o $\to$ TTS) -- across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5\%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn-take rate (78.0\%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~s). Across all systems, self-correction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.