GRAM: Spatial general-purpose audio representations for real-world environments

Goksenin Yuksel; Marcel van Gerven; Kiki van der Heijden

arXiv:2602.03307·cs.SD·February 5, 2026

GRAM: Spatial general-purpose audio representations for real-world environments

Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

PDF

Open Access

TL;DR

GRAM is a novel spatial audio model that learns representations from multi-channel recordings, excelling in real-world environments and tasks like sound localization, outperforming existing models with less training data.

Contribution

The paper introduces GRAM, a multi-channel masked autoencoder for spatial audio, and provides standardized benchmarks demonstrating its superior performance in real-world acoustic tasks.

Findings

01

Outperforms state-of-the-art models on NatHEAR and HEAR benchmarks.

02

Achieves high localization accuracy in simulated environments.

03

Generalizes effectively to real-world recordings in RealSELD.

Abstract

Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation