Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio   Captioning

Jaeyeon Kim; Jaeyoon Jung; Minjeong Jeon; Sang Hoon Woo; Jinjoo Lee

arXiv:2409.01160·eess.AS·September 4, 2024

Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning

Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee

PDF

Open Access

TL;DR

This paper presents an improved audio captioning system based on EnCLAP with auxiliary retrieval, achieving state-of-the-art results in automated audio captioning and retrieval tasks.

Contribution

We extend the EnCLAP framework with modifications and reranking, and introduce a supplementary retriever model for enhanced audio captioning and retrieval performance.

Findings

01

FENSE score of 0.542 on Task6

02

mAP@10 score of 0.386 on Task8

03

Significant outperforming of baseline models

Abstract

In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Music and Audio Processing · Subtitles and Audiovisual Media