Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment
Maureen de Seyssel, Eeshan Gunesh Dhekane

TL;DR
This paper introduces a comprehensive taxonomy for evaluating speech models, aligning evaluation methods with model capabilities and task requirements to improve assessment consistency and identify gaps.
Contribution
It presents a unified, axes-based taxonomy for classifying speech model evaluations, facilitating better alignment between models and appropriate assessment protocols.
Findings
Classifies existing evaluations along three orthogonal axes.
Maps evaluations to model capabilities and methodological demands.
Identifies gaps in current evaluation coverage, such as prosody and reasoning.
Abstract
Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the evaluation aspect being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Emotion and Mood Recognition
