SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Wenxi Chen, Ziyang Ma, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng, Zheng, Kai Yu, Xie Chen

TL;DR
SLAM-AAC enhances automated audio captioning by integrating paraphrasing augmentation and CLAP-Refine with large language models, leading to more diverse and accurate descriptions and achieving state-of-the-art results.
Contribution
The paper introduces a novel AAC framework combining paraphrasing augmentation and CLAP-Refine, leveraging LLMs and self-supervised audio representations for improved captioning performance.
Findings
Achieves state-of-the-art results on Clotho V2 and AudioCaps datasets.
Utilizes paraphrasing augmentation to diversify captions from limited audio-text pairs.
Employs CLAP-Refine for effective selection of best captions through multiple decoding outputs.
Abstract
Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual reasoning capabilities, making improvements in AAC possible. In this paper, we propose SLAM-AAC to further enhance AAC with paraphrasing augmentation and CLAP-Refine through LLMs. Our approach uses the self-supervised EAT model to extract fine-grained audio representations, which are then aligned with textual embeddings via lightweight linear layers. The caption generation LLM is efficiently fine-tuned using the LoRA adapter. Drawing inspiration from the back-translation method in machine translation, we implement paraphrasing augmentation to expand the Clotho dataset during pre-training. This strategy helps alleviate the limitation of scarce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
