Enhancing Automated Audio Captioning via Large Language Models with   Optimized Audio Encoding

Jizhong Liu; Gang Li; Junbo Zhang; Heinrich Dinkel; Yongqing Wang,; Zhiyong Yan; Yujun Wang; Bin Wang

arXiv:2406.13275·cs.SD·June 26, 2024

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang,, Zhiyong Yan, Yujun Wang, Bin Wang

PDF

Open Access 1 Repo

TL;DR

This paper improves automated audio captioning by integrating optimized audio encoding, large language models, and error correction, achieving state-of-the-art results in descriptive accuracy and robustness.

Contribution

It introduces a novel AAC framework combining a pre-trained audio encoder with ensemble distillation, a Llama 2 decoder, and a text correction LLM, all optimized with LoRA.

Findings

01

Achieved 33.0 SPIDEr-FL score, surpassing previous bests.

02

Demonstrated effectiveness of ensemble distillation and Llama 2 in AAC.

03

Validated improvements through extensive experiments.

Abstract

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

frankenliu/LOAE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsLLaMA