NAAQA: A Neural Architecture for Acoustic Question Answering

Jerome Abdelnour; Jean Rouat; Giampiero Salvi

arXiv:2106.06147·cs.CL·January 15, 2024

NAAQA: A Neural Architecture for Acoustic Question Answering

Jerome Abdelnour, Jean Rouat, Giampiero Salvi

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces NAAQA, a neural architecture tailored for acoustic question answering, demonstrating improved accuracy with fewer parameters on a new benchmark and analyzing the impact of temporal and spectral features.

Contribution

The paper proposes NAAQA, a novel neural network architecture for acoustic question answering, and introduces the CLEAR2 benchmark emphasizing acoustic input challenges.

Findings

01

NAAQA achieves 79.5% accuracy with fewer parameters.

02

Time coordinate maps improve performance by ~17%.

03

Frequency coordinate maps have minimal impact.

Abstract

The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

necotis/naaqa-acoustic-question-answering
pytorch

Datasets

J3romee/CLEAR
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Advanced Image and Video Retrieval Techniques