ViDiC: Video Difference Captioning

Jiangtao Wu; Shihao Li; Zhaozhou Bian; Jialu Chen; Runzhe Wen; An Ping; Yiwen He; Jiakai Wang; Yuanxing Zhang; Jiaheng Liu

arXiv:2512.03405·cs.CV·March 25, 2026

ViDiC: Video Difference Captioning

Jiangtao Wu, Shihao Li, Zhaozhou Bian, Jialu Chen, Runzhe Wen, An Ping, Yiwen He, Jiakai Wang, Yuanxing Zhang, Jiaheng Liu

PDF

Open Access 1 Datasets

TL;DR

ViDiC introduces a new task and dataset for evaluating how well multimodal models can describe and compare dynamic video scenes, emphasizing motion, event evolution, and editing consistency.

Contribution

The paper presents the ViDiC task and ViDiC-1K dataset, along with a dual-checklist evaluation framework, to advance video difference captioning and comparative reasoning in multimodal models.

Findings

01

Significant performance gap in existing models' ability to describe video differences

02

ViDiC-1K provides a challenging benchmark for video understanding

03

Dual-checklist framework effectively measures similarity and difference accuracy

Abstract

Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

NJU-LINK/ViDiC-1K
dataset· 1.8k dl
1.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis