Automated Audio Captioning and Language-Based Audio Retrieval
Clive Gomes, Hyejin Park, Patrick Kollman, Yi Song, Iffanice Houndayi,, Ankit Shah

TL;DR
This paper describes participation in the DCASE 2022 Competition focusing on automated audio captioning and language-based audio retrieval, with experiments modifying baseline models and achieving competitive results.
Contribution
It introduces modified models for audio captioning and retrieval, with the retrieval model surpassing baseline performance in the competition.
Findings
Retrieval model outperformed baseline in DCASE 2022.
Captioning model achieved performance close to baseline.
Experiments demonstrated effectiveness of model modifications.
Abstract
This project involved participation in the DCASE 2022 Competition (Task 6) which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based Audio Retrieval. The first subtask involved the generation of a textual description for audio samples, while the goal of the second was to find audio samples within a fixed dataset that match a given description. For both subtasks, the Clotho dataset was used. The models were evaluated on BLEU1, BLEU2, BLEU3, ROUGEL, METEOR, CIDEr, SPICE, and SPIDEr scores for audio captioning and R1, R5, R10 and mARP10 scores for audio retrieval. We have conducted a handful of experiments that modify the baseline models for these tasks. Our final architecture for Automated Audio Captioning is close to the baseline performance, while our model for Language-Based Audio Retrieval has surpassed its counterpart.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech and Audio Processing
