EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio   Captioning Performance

Jaeyeon Kim; Minjeon Jeon; Jaeyoon Jung; Sang Hoon Woo; Jinjoo Lee

arXiv:2409.01201·eess.AS·September 4, 2024

EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

Jaeyeon Kim, Minjeon Jeon, Jaeyoon Jung, Sang Hoon Woo, Jinjoo Lee

PDF

Open Access 1 Repo

TL;DR

This paper analyzes and enhances the EnCLAP framework for automated audio captioning by modifying components, experimenting with pretraining datasets, and implementing reranking, resulting in a significantly improved model called EnCLAP++.

Contribution

It introduces EnCLAP++, an improved audio captioning model, through systematic analysis and optimization of the original EnCLAP framework.

Findings

01

EnCLAP++ outperforms the original EnCLAP in caption quality.

02

Modifications to the acoustic encoder improve captioning performance.

03

Pretraining with larger datasets enhances model effectiveness.

Abstract

In this work, we aim to analyze and optimize the EnCLAP framework, a state-of-the-art model in automated audio captioning. We investigate the impact of modifying the acoustic encoder components, explore pretraining with different dataset scales, and study the effectiveness of a reranking scheme. Through extensive experimentation and quantitative analysis of generated captions, we develop EnCLAP++, an enhanced version that significantly surpasses the original.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaeyeonkim99/enclap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Speech Recognition and Synthesis