EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Jaeyeon Kim, Minjeon Jeon, Jaeyoon Jung, Sang Hoon Woo, Jinjoo Lee

TL;DR
This paper analyzes and enhances the EnCLAP framework for automated audio captioning by modifying components, experimenting with pretraining datasets, and implementing reranking, resulting in a significantly improved model called EnCLAP++.
Contribution
It introduces EnCLAP++, an improved audio captioning model, through systematic analysis and optimization of the original EnCLAP framework.
Findings
EnCLAP++ outperforms the original EnCLAP in caption quality.
Modifications to the acoustic encoder improve captioning performance.
Pretraining with larger datasets enhances model effectiveness.
Abstract
In this work, we aim to analyze and optimize the EnCLAP framework, a state-of-the-art model in automated audio captioning. We investigate the impact of modifying the acoustic encoder components, explore pretraining with different dataset scales, and study the effectiveness of a reranking scheme. Through extensive experimentation and quantitative analysis of generated captions, we develop EnCLAP++, an enhanced version that significantly surpasses the original.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Speech Recognition and Synthesis
