From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs

Yuhang Jia; Xu Zhang; Yujie Guo; Yang Chen; Shiwan Zhao

arXiv:2508.01659·cs.SD·January 27, 2026

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs

Yuhang Jia, Xu Zhang, Yujie Guo, Yang Chen, Shiwan Zhao

PDF

Open Access

TL;DR

This paper introduces Audio Commonality Captioning (ACC), a new task that emphasizes shared semantics across audio clips to improve audio-text understanding in multimodal language models, addressing limitations of previous difference-focused methods.

Contribution

The paper proposes ACC as a novel, more effective alternative to ADC, enhancing cross-modal understanding and robustness in multimodal LLMs by focusing on commonalities rather than differences.

Findings

01

ACC improves captioning benchmark performance

02

ACC preserves general speech and music understanding

03

ACC balances task-specific and general capabilities

Abstract

Audio Captioning (AC) plays a pivotal role in enhancing audio-text cross-modal understanding during the pretraining and finetuning of Multimodal LLMs (MLLMs). To strengthen this alignment, recent works propose Audio Difference Captioning (ADC), which takes multiple audio inputs and encourages the model to describe their differences, thereby promoting fine-grained discrimination. However, despite its effectiveness, ADC introduces a semantic gap between input audios-often rich in diverse events-and the brief, difference-focused short caption. This deviation from AC-style task causes a mismatch with the pretraining objective, leading to catastrophic forgetting. To address this, we propose Audio Commonality Captioning (ACC), a comparably challenging but gentler alternative that guides the model to capture shared semantics across audio clips rather than detailed differences. Experiments show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Topic Modeling