Massive Sound Embedding Benchmark (MSEB)

Georg Heigold; Ehsan Variani; Tom Bagby; Cyril Allauzen; Ji Ma; Shankar Kumar; Michael Riley

arXiv:2602.07143·cs.SD·February 10, 2026

Massive Sound Embedding Benchmark (MSEB)

Georg Heigold, Ehsan Variani, Tom Bagby, Cyril Allauzen, Ji Ma, Shankar Kumar, Michael Riley

PDF

Open Access 1 Video

TL;DR

The Massive Sound Embedding Benchmark (MSEB) provides a comprehensive, extensible framework for evaluating auditory capabilities in multimodal systems through diverse tasks and datasets, including a new large-scale voice dataset.

Contribution

MSEB introduces a new benchmark framework with multiple core tasks and datasets for assessing sound embeddings in multimodal AI systems, including a novel large-scale voice dataset.

Findings

01

Initial experiments reveal significant performance headroom.

02

Diverse datasets enable comprehensive evaluation.

03

Framework encourages community contributions.

Abstract

Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Massive Sound Embedding Benchmark (MSEB)· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Emotion and Mood Recognition