D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
Changli Tang, Tianyi Wang, Fengyun Rao, Jing Lyu, Chao Zhang

TL;DR
D-ORCA is a dialogue-centric, multimodal large language model designed for robust audio-visual captioning, utilizing a new bilingual dataset and novel reinforcement learning rewards to improve speaker attribution, speech recognition, and temporal grounding.
Contribution
Introduces D-ORCA, a novel dialogue-centric multimodal LLM with a large bilingual dataset and reinforcement learning rewards for improved audio-visual captioning.
Findings
D-ORCA outperforms existing open-source models in key tasks.
Achieves performance comparable to larger models like Qwen3-Omni.
Effectively utilizes reinforcement learning rewards for captioning accuracy.
Abstract
Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Subtitles and Audiovisual Media
