Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

Samuel Lipping; Parthasaarathy Sudarsanam; Konstantinos Drossos,; Tuomas Virtanen

arXiv:2204.09634·cs.SD·June 20, 2022

Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos,, Tuomas Virtanen

PDF

Open Access 1 Datasets

TL;DR

Clotho-AQA is a new crowdsourced dataset for audio question answering, featuring 1991 audio clips with diverse questions and answers, enabling development and evaluation of multimodal AQA models.

Contribution

The paper introduces Clotho-AQA, a novel dataset for audio question answering with crowdsourced questions and answers, and provides baseline models for the task.

Findings

01

Baseline models achieved 62.7% accuracy for yes/no questions.

02

Top-1 accuracy of 54.2% for multi-class answers.

03

Dataset is publicly available for research use.

Abstract

Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have 'yes' and 'no' as answers, while the remaining two questions have other single-word answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MERA-evaluation/ruEnvAQA
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and dialogue systems · Speech and Audio Processing