NAAQA: A Neural Architecture for Acoustic Question Answering
Jerome Abdelnour, Jean Rouat, Giampiero Salvi

TL;DR
This paper introduces NAAQA, a neural architecture tailored for acoustic question answering, demonstrating improved accuracy with fewer parameters on a new benchmark and analyzing the impact of temporal and spectral features.
Contribution
The paper proposes NAAQA, a novel neural network architecture for acoustic question answering, and introduces the CLEAR2 benchmark emphasizing acoustic input challenges.
Findings
NAAQA achieves 79.5% accuracy with fewer parameters.
Time coordinate maps improve performance by ~17%.
Frequency coordinate maps have minimal impact.
Abstract
The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Advanced Image and Video Retrieval Techniques
