GRAM: Spatial general-purpose audio representation models for real-world applications
Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

TL;DR
GRAM is a novel spatial audio model that learns from multi-channel data, excelling in real-world acoustic tasks like sound localization and outperforming existing models with less training data.
Contribution
The paper introduces GRAM, a multi-channel masked autoencoder for spatial audio, and provides standardized benchmarks demonstrating its superior performance in real-world environments.
Findings
Outperforms state-of-the-art models on spatial audio benchmarks.
Achieves high localization accuracy in simulated environments.
Generalizes effectively to real-world recordings.
Abstract
Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
