Learning Spatially-Aware Language and Audio Embeddings

Bhavika Devnani; Skyler Seto; Zakaria Aldeneh; Alessandro Toso; Elena; Menyaylenko; Barry-John Theobald; Jonathan Sheaffer; Miguel Sarabia

arXiv:2409.11369·cs.SD·November 27, 2024

Learning Spatially-Aware Language and Audio Embeddings

Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena, Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia

PDF

Open Access 1 Video

TL;DR

ELSA is a novel spatially aware audio and text embedding model that uses contrastive learning to understand and localize sounds with spatial and semantic context, bridging the gap between non-spatial models and fixed-class localization.

Contribution

The paper introduces ELSA, a multimodal contrastive learning model that captures spatial and semantic sound attributes from large-scale datasets, enabling open-vocabulary and spatially-aware sound understanding.

Findings

01

ELSA outperforms state-of-the-art in semantic retrieval (+2.8% R@1)

02

ELSA improves 3D source localization accuracy (-11.6° MAE)

03

ELSA supports both non-spatial and spatial audio with open captions.

Abstract

Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning Spatially-Aware Language and Audio Embeddings· slideslive

Taxonomy

TopicsSpeech and dialogue systems

MethodsEvolved Sign Momentum · ALIGN