DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models

Xinlong Chen; Weihong Lin; Jingyun Hua; Linli Yao; Yue Ding; Bozhou Li; Bohan Zeng; Yang Shi; Qiang Liu; Yuanxing Zhang; Pengfei Wan; Liang Wang; Tieniu Tan

arXiv:2601.19267·cs.CL·January 28, 2026

DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models

Xinlong Chen, Weihong Lin, Jingyun Hua, Linli Yao, Yue Ding, Bozhou Li, Bohan Zeng, Yang Shi, Qiang Liu, Yuanxing Zhang, Pengfei Wan, Liang Wang, Tieniu Tan

PDF

Open Access 1 Models 1 Datasets

TL;DR

DiaDem is a novel audiovisual captioning model that significantly improves dialogue description accuracy in video captions, addressing a key challenge in multimodal understanding and generation tasks.

Contribution

The paper introduces DiaDem, a new model with a specialized dataset and benchmark for enhanced dialogue description in audiovisual captioning, outperforming existing models.

Findings

01

DiaDem outperforms Gemini series in dialogue accuracy

02

Extensive experiments show room for improvement in commercial models

03

DiaDem achieves competitive results on general captioning benchmarks

Abstract

Accurate dialogue description in audiovisual video captioning is crucial for downstream understanding and generation tasks. However, existing models generally struggle to produce faithful dialogue descriptions within audiovisual captions. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions while maintaining strong overall performance. We first synthesize a high-quality dataset for SFT, then employ a difficulty-partitioned two-stage GRPO strategy to further enhance dialogue descriptions. To enable systematic evaluation of dialogue description capabilities, we introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
DiaDem-Captioner/DiaDem
model· 7 dl· ♡ 1
7 dl♡ 1

Datasets

DiaDem-Captioner/DiaDemBench
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis