GRAM: Spatial general-purpose audio representation models for real-world applications

Goksenin Yuksel; Marcel van Gerven; Kiki van der Heijden

arXiv:2506.00934·cs.SD·February 5, 2026

GRAM: Spatial general-purpose audio representation models for real-world applications

Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

PDF

Open Access 4 Models

TL;DR

GRAM is a novel spatial audio model that learns from multi-channel data, excelling in real-world acoustic tasks like sound localization and outperforming existing models with less training data.

Contribution

The paper introduces GRAM, a multi-channel masked autoencoder for spatial audio, and provides standardized benchmarks demonstrating its superior performance in real-world environments.

Findings

01

Outperforms state-of-the-art models on spatial audio benchmarks.

02

Achieves high localization accuracy in simulated environments.

03

Generalizes effectively to real-world recordings.

Abstract

Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces