SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and   CLAP-Refine through LLMs

Wenxi Chen; Ziyang Ma; Xiquan Li; Xuenan Xu; Yuzhe Liang; Zhisheng; Zheng; Kai Yu; Xie Chen

arXiv:2410.09503·eess.AS·October 15, 2024

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Wenxi Chen, Ziyang Ma, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng, Zheng, Kai Yu, Xie Chen

PDF

Open Access 1 Repo

TL;DR

SLAM-AAC enhances automated audio captioning by integrating paraphrasing augmentation and CLAP-Refine with large language models, leading to more diverse and accurate descriptions and achieving state-of-the-art results.

Contribution

The paper introduces a novel AAC framework combining paraphrasing augmentation and CLAP-Refine, leveraging LLMs and self-supervised audio representations for improved captioning performance.

Findings

01

Achieves state-of-the-art results on Clotho V2 and AudioCaps datasets.

02

Utilizes paraphrasing augmentation to diversify captions from limited audio-text pairs.

03

Employs CLAP-Refine for effective selection of best captions through multiple decoding outputs.

Abstract

Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual reasoning capabilities, making improvements in AAC possible. In this paper, we propose SLAM-AAC to further enhance AAC with paraphrasing augmentation and CLAP-Refine through LLMs. Our approach uses the self-supervised EAT model to extract fine-grained audio representations, which are then aligned with textual embeddings via lightweight linear layers. The caption generation LLM is efficiently fine-tuned using the LoRA adapter. Drawing inspiration from the back-translation method in machine translation, we implement paraphrasing augmentation to expand the Clotho dataset during pre-training. This strategy helps alleviate the limitation of scarce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

X-LANCE/SLAM-LLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training