LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

Hyeongkeun Lee; Jongmin Choi; KiHyun Nam; Joon Son Chung

arXiv:2601.04658·cs.SD·March 17, 2026

LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

Hyeongkeun Lee, Jongmin Choi, KiHyun Nam, Joon Son Chung

PDF

Open Access

TL;DR

This paper introduces LAMB, a novel framework that enhances audio captioning by aligning audio embeddings with LLM text space using Cauchy-Schwarz divergence, resulting in improved reasoning and state-of-the-art results.

Contribution

LAMB is the first to explicitly bridge the modality gap between audio and LLM embeddings with a divergence-based alignment and a two-stream adapter for richer audio features.

Findings

01

Achieves state-of-the-art performance on AudioCaps.

02

Effectively aligns audio and text embeddings at global and token levels.

03

Enhances reasoning capabilities of LLM-based audio captioning.

Abstract

Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech Recognition and Synthesis