TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao; Yuancheng Wei; Yaojie Zhang; Lei Li; Xinlong Chen; Feifan Song; Ziyue Wang; Kun Ouyang; Yuanxin Liu; Lingpeng Kong; Qi Liu; Pengfei Wan; Kun Gai; Yuanxing Zhang; Xu Sun

arXiv:2602.08711·cs.CV·February 13, 2026

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao, Yuancheng Wei, Yaojie Zhang, Lei Li, Xinlong Chen, Feifan Song, Ziyue Wang, Kun Ouyang, Yuanxin Liu, Lingpeng Kong, Qi Liu, Pengfei Wan, Kun Gai, Yuanxing Zhang, Xu Sun

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces Omni Dense Captioning, a new task for generating detailed, structured, and time-aware audio-visual narratives with a comprehensive benchmark and a strong baseline model, advancing video captioning and reasoning capabilities.

Contribution

It proposes a novel task, Omni Dense Captioning, along with a structured schema, benchmark dataset, evaluation metric, and a powerful baseline model trained for dense, time-aware video captioning.

Findings

01

The baseline model achieves state-of-the-art performance.

02

Dense descriptions improve audio-visual reasoning.

03

Structured captions enhance temporal grounding.

Abstract

This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yaolily/TimeChat-Captioner-GRPO-7B
model· 131 dl· ♡ 2
131 dl♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Subtitles and Audiovisual Media