Vidi2.5: Large Multimodal Models for Video Understanding and Creation

Vidi Team; Chia-Wen Kuo; Chuang Huang; Dawei Du; Fan Chen; Fanding Lei; Feng Gao; Guang Chen; Haoji Zhang; Haojun Zhao; Jin Liu; Jingjing Zhuge; Lili Fang; Lingxi Zhang; Longyin Wen; Lu Guo; Lu Xu; Lusha Li; Qihang Fan; Rachel Deng; Shaobo Fang; Shu Zhang; Sijie Zhu; Stuart Siew; Weiyan Tao; Wen Zhong; Xiaohui Shen; Xin Gu; Ye Yuan; Yicheng He; Yiming Cui; Zhenfang Chen; Zhihua Wu; Zuhua Lin

arXiv:2511.19529·cs.CV·January 21, 2026

Vidi2.5: Large Multimodal Models for Video Understanding and Creation

Vidi Team, Chia-Wen Kuo, Chuang Huang, Dawei Du, Fan Chen, Fanding Lei, Feng Gao, Guang Chen, Haoji Zhang, Haojun Zhao, Jin Liu, Jingjing Zhuge, Lili Fang, Lingxi Zhang, Longyin Wen, Lu Guo, Lu Xu, Lusha Li, Qihang Fan, Rachel Deng, Shaobo Fang, Shu Zhang, Sijie Zhu, Stuart Siew

PDF

Open Access 2 Models

TL;DR

Vidi2.5 is a state-of-the-art multimodal video understanding and creation model that advances fine-grained spatio-temporal grounding, video question answering, and plot reasoning, outperforming proprietary systems and setting new benchmarks.

Contribution

The paper introduces Vidi2.5, a new version with enhanced capabilities in video understanding, including a novel plot reasoning model and benchmarks for comprehensive evaluation.

Findings

01

Vidi2.5 outperforms proprietary systems like Gemini 3 Pro and GPT-5.

02

Introduces VUE-STG and VUE-PLOT benchmarks for detailed evaluation.

03

Vidi2.5-Think excels in character understanding and plot reasoning.

Abstract

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. To enable comprehensive evaluation of STG, we introduce a new benchmark, VUE-STG, which offers critical improvements over existing STG datasets. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques