LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

Kyeongha Rho; Hyeongkeun Lee; Valentio Iverson; Joon Son Chung

arXiv:2501.09291·cs.MM·March 18, 2025

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

Kyeongha Rho, Hyeongkeun Lee, Valentio Iverson, Joon Son Chung

PDF

Open Access 2 Repos

TL;DR

LAVCap introduces a novel LLM-based framework for audio-visual captioning that employs optimal transport techniques to effectively fuse modalities, significantly improving performance on the AudioCaps dataset.

Contribution

The paper presents a new LLM-based audio-visual captioning framework utilizing optimal transport for better modality alignment and fusion, outperforming existing methods.

Findings

01

Outperforms state-of-the-art on AudioCaps dataset

02

Effective modality alignment with optimal transport loss

03

Enhanced fusion via optimal transport attention module

Abstract

Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Subtitles and Audiovisual Media

MethodsSoftmax · Attention Is All You Need