D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

Changli Tang; Tianyi Wang; Fengyun Rao; Jing Lyu; Chao Zhang

arXiv:2602.07960·cs.CV·February 10, 2026

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

Changli Tang, Tianyi Wang, Fengyun Rao, Jing Lyu, Chao Zhang

PDF

Open Access 1 Models

TL;DR

D-ORCA is a dialogue-centric, multimodal large language model designed for robust audio-visual captioning, utilizing a new bilingual dataset and novel reinforcement learning rewards to improve speaker attribution, speech recognition, and temporal grounding.

Contribution

Introduces D-ORCA, a novel dialogue-centric multimodal LLM with a large bilingual dataset and reinforcement learning rewards for improved audio-visual captioning.

Findings

01

D-ORCA outperforms existing open-source models in key tasks.

02

Achieves performance comparable to larger models like Qwen3-Omni.

03

Effectively utilizes reinforcement learning rewards for captioning accuracy.

Abstract

Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tsinghua-ee/D-ORCA-8B-0210
model· 52 dl· ♡ 1
52 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Subtitles and Audiovisual Media