Towards Spatial Audio Understanding via Question Answering

Parthasaarathy Sudarsanam; Archontis Politis

arXiv:2507.09195·cs.SD·July 15, 2025

Towards Spatial Audio Understanding via Question Answering

Parthasaarathy Sudarsanam, Archontis Politis

PDF

Open Access

TL;DR

This paper presents a new question answering framework for spatial audio understanding using FOA signals, combining dataset creation, linguistic diversity, and a baseline model to interpret sound scenes with minimal supervision.

Contribution

It introduces a novel QA-based approach for spatial audio understanding, including dataset curation, linguistic enhancement, and a baseline model trained with scene-level supervision.

Findings

01

Model achieves performance comparable to fully supervised methods

02

Enhanced linguistic diversity improves question answering robustness

03

Dataset and baseline model facilitate future research in spatial audio QA

Abstract

In this paper, we introduce a novel framework for spatial audio understanding of first-order ambisonic (FOA) signals through a question answering (QA) paradigm, aiming to extend the scope of sound event localization and detection (SELD) towards spatial scene understanding and reasoning. First, we curate and release fine-grained spatio-temporal textual descriptions for the STARSS23 dataset using a rule-based approach, and further enhance linguistic diversity using large language model (LLM)-based rephrasing. We also introduce a QA dataset aligned with the STARSS23 scenes, covering various aspects such as event presence, localization, spatial, and temporal relationships. To increase language variety, we again leverage LLMs to generate multiple rephrasings per question. Finally, we develop a baseline spatial audio QA model that takes FOA signals and natural language questions as input and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Subtitles and Audiovisual Media