MAEB: Massive Audio Embedding Benchmark

Adnan El Assadi; Isaac Chung; Chenghao Xiao; Roman Solomatin; Animesh Jha; Rahul Chand; Silky Singh; Kaitlyn Wang; Ali Sartaz Khan; Marc Moussa Nasser; Sufen Fong; Pengfei He; Alan Xiao; Ayush Sunil Munot; Aditya Shrivastava; Artem Gazizov; Niklas Muennighoff; and Kenneth Enevoldsen

arXiv:2602.16008·cs.SD·February 19, 2026

MAEB: Massive Audio Embedding Benchmark

Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff

PDF

Open Access

TL;DR

MAEB introduces a comprehensive large-scale benchmark for audio embeddings across diverse tasks and languages, revealing varied model strengths and challenges in clustering and cross-modal understanding, with implications for audio large language models.

Contribution

The paper presents MAEB, a new large-scale, diverse audio benchmark derived from MAEB+ that enables unified evaluation of audio models across multiple tasks and languages.

Findings

01

No single model dominates all tasks

02

Contrastive models excel in environmental sounds but not speech

03

Speech-pretrained models perform better on linguistic tasks

Abstract

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing