Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee

TL;DR
This paper presents an improved audio captioning system based on EnCLAP with auxiliary retrieval, achieving state-of-the-art results in automated audio captioning and retrieval tasks.
Contribution
We extend the EnCLAP framework with modifications and reranking, and introduce a supplementary retriever model for enhanced audio captioning and retrieval performance.
Findings
FENSE score of 0.542 on Task6
mAP@10 score of 0.386 on Task8
Significant outperforming of baseline models
Abstract
In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Music and Audio Processing · Subtitles and Audiovisual Media
