LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
Hyeongkeun Lee, Jongmin Choi, KiHyun Nam, Joon Son Chung

TL;DR
This paper introduces LAMB, a novel framework that enhances audio captioning by aligning audio embeddings with LLM text space using Cauchy-Schwarz divergence, resulting in improved reasoning and state-of-the-art results.
Contribution
LAMB is the first to explicitly bridge the modality gap between audio and LLM embeddings with a divergence-based alignment and a two-stream adapter for richer audio features.
Findings
Achieves state-of-the-art performance on AudioCaps.
Effectively aligns audio and text embeddings at global and token levels.
Enhances reasoning capabilities of LLM-based audio captioning.
Abstract
Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech Recognition and Synthesis
