TL;DR
MultiVox is a new benchmark designed to evaluate how well voice assistants understand and integrate multimodal cues, including speech paralinguistics and visual information, for more context-aware responses.
Contribution
It introduces the first comprehensive benchmark for assessing multimodal understanding in voice assistants, focusing on paralinguistic speech features and visual cues.
Findings
Current models underperform compared to humans in multimodal understanding.
MultiVox includes 1000 annotated dialogues with diverse speech and visual cues.
Evaluation reveals significant gaps in models' ability to generate contextually grounded responses.
Abstract
The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
