Vidi2.5: Large Multimodal Models for Video Understanding and Creation
Vidi Team, Chia-Wen Kuo, Chuang Huang, Dawei Du, Fan Chen, Fanding Lei, Feng Gao, Guang Chen, Haoji Zhang, Haojun Zhao, Jin Liu, Jingjing Zhuge, Lili Fang, Lingxi Zhang, Longyin Wen, Lu Guo, Lu Xu, Lusha Li, Qihang Fan, Rachel Deng, Shaobo Fang, Shu Zhang, Sijie Zhu, Stuart Siew

TL;DR
Vidi2.5 is a state-of-the-art multimodal video understanding and creation model that advances fine-grained spatio-temporal grounding, video question answering, and plot reasoning, outperforming proprietary systems and setting new benchmarks.
Contribution
The paper introduces Vidi2.5, a new version with enhanced capabilities in video understanding, including a novel plot reasoning model and benchmarks for comprehensive evaluation.
Findings
Vidi2.5 outperforms proprietary systems like Gemini 3 Pro and GPT-5.
Introduces VUE-STG and VUE-PLOT benchmarks for detailed evaluation.
Vidi2.5-Think excels in character understanding and plot reasoning.
Abstract
Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. To enable comprehensive evaluation of STG, we introduce a new benchmark, VUE-STG, which offers critical improvements over existing STG datasets. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
