Towards Spatial Audio Understanding via Question Answering
Parthasaarathy Sudarsanam, Archontis Politis

TL;DR
This paper presents a new question answering framework for spatial audio understanding using FOA signals, combining dataset creation, linguistic diversity, and a baseline model to interpret sound scenes with minimal supervision.
Contribution
It introduces a novel QA-based approach for spatial audio understanding, including dataset curation, linguistic enhancement, and a baseline model trained with scene-level supervision.
Findings
Model achieves performance comparable to fully supervised methods
Enhanced linguistic diversity improves question answering robustness
Dataset and baseline model facilitate future research in spatial audio QA
Abstract
In this paper, we introduce a novel framework for spatial audio understanding of first-order ambisonic (FOA) signals through a question answering (QA) paradigm, aiming to extend the scope of sound event localization and detection (SELD) towards spatial scene understanding and reasoning. First, we curate and release fine-grained spatio-temporal textual descriptions for the STARSS23 dataset using a rule-based approach, and further enhance linguistic diversity using large language model (LLM)-based rephrasing. We also introduce a QA dataset aligned with the STARSS23 scenes, covering various aspects such as event presence, localization, spatial, and temporal relationships. To increase language variety, we again leverage LLMs to generate multiple rephrasings per question. Finally, we develop a baseline spatial audio QA model that takes FOA signals and natural language questions as input and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Subtitles and Audiovisual Media
