MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions

Ramaneswaran Selvakumar; Ashish Seth; Nishit Anand; Utkarsh Tyagi; Sonal Kumar; Sreyan Ghosh; Dinesh Manocha

arXiv:2507.10859·cs.MM·September 29, 2025

MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions

Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

PDF

1 Video

TL;DR

MultiVox is a new benchmark designed to evaluate how well voice assistants understand and integrate multimodal cues, including speech paralinguistics and visual information, for more context-aware responses.

Contribution

It introduces the first comprehensive benchmark for assessing multimodal understanding in voice assistants, focusing on paralinguistic speech features and visual cues.

Findings

01

Current models underperform compared to humans in multimodal understanding.

02

MultiVox includes 1000 annotated dialogues with diverse speech and visual cues.

03

Evaluation reveals significant gaps in models' ability to generate contextually grounded responses.

Abstract

The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MULTIVOX: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions· underline