Massive Sound Embedding Benchmark (MSEB)
Georg Heigold, Ehsan Variani, Tom Bagby, Cyril Allauzen, Ji Ma, Shankar Kumar, Michael Riley

TL;DR
The Massive Sound Embedding Benchmark (MSEB) provides a comprehensive, extensible framework for evaluating auditory capabilities in multimodal systems through diverse tasks and datasets, including a new large-scale voice dataset.
Contribution
MSEB introduces a new benchmark framework with multiple core tasks and datasets for assessing sound embeddings in multimodal AI systems, including a novel large-scale voice dataset.
Findings
Initial experiments reveal significant performance headroom.
Diverse datasets enable comprehensive evaluation.
Framework encourages community contributions.
Abstract
Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Emotion and Mood Recognition
