Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos,, Tuomas Virtanen

TL;DR
Clotho-AQA is a new crowdsourced dataset for audio question answering, featuring 1991 audio clips with diverse questions and answers, enabling development and evaluation of multimodal AQA models.
Contribution
The paper introduces Clotho-AQA, a novel dataset for audio question answering with crowdsourced questions and answers, and provides baseline models for the task.
Findings
Baseline models achieved 62.7% accuracy for yes/no questions.
Top-1 accuracy of 54.2% for multi-class answers.
Dataset is publicly available for research use.
Abstract
Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have 'yes' and 'no' as answers, while the remaining two questions have other single-word answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and dialogue systems · Speech and Audio Processing
