Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)
Cyril Allauzen, Tom Bagby, Georg Heigold, Ehsan Variani, Ke Wu

TL;DR
This paper evaluates leading audio-native LLMs on the Massive Sound Embedding Benchmark, highlighting persistent modality gaps and discussing architecture choices based on use-case needs.
Contribution
It provides a comprehensive empirical comparison of top LLMs on MSEB, analyzing performance, robustness, and architectural implications for audio-text tasks.
Findings
Significant modality gap remains in performance and robustness.
No conclusive evidence favors a single optimal modeling approach.
Architecture choice depends on latency, cost, and reasoning requirements.
Abstract
The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB capabilities to assess their efficacy and audio-text parity. Our results indicate that while a significant modality gap persists regarding performance and robustness, the empirical evidence for an "optimal" modeling approach remains inconclusive. Ultimately, the choice between audionative and cascaded architectures depends heavily on specific use-case requirements and the underlying assumptions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
