Spotlight: Identifying and Localizing Video Generation Errors Using VLMs

Aditya Chinchure; Sahithya Ravi; Pushkar Shukla; Vered Shwartz; Leonid Sigal

arXiv:2511.18102·cs.CV·November 25, 2025

Spotlight: Identifying and Localizing Video Generation Errors Using VLMs

Aditya Chinchure, Sahithya Ravi, Pushkar Shukla, Vered Shwartz, Leonid Sigal

PDF

Open Access

TL;DR

This paper introduces Spotlight, a new task for localizing and explaining errors in text-to-video generation, revealing current models' limitations and proposing strategies to improve error detection.

Contribution

The paper presents a novel task and dataset for error localization in video generation, along with evaluation of current models and strategies to enhance their performance.

Findings

01

Current VLMs lag behind humans in error detection.

02

Adherence and physics errors are most common and persistent.

03

Inference strategies can nearly double VLM performance.

Abstract

Current text-to-video models (T2V) can generate high-quality, temporally coherent, and visually realistic videos. Nonetheless, errors still often occur, and are more nuanced and local compared to the previous generation of T2V models. While current evaluation paradigms assess video models across diverse dimensions, they typically evaluate videos holistically without identifying when specific errors occur or describing their nature. We address this gap by introducing Spotlight, a novel task aimed at localizing and explaining video-generation errors. We generate 600 videos using 200 diverse textual prompts and three state-of-the-art video generators (Veo 3, Seedance, and LTX-2), and annotate over 1600 fine-grained errors across six types, including motion, physics, and prompt adherence. We observe that adherence and physics errors are predominant and persist across longer segments,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)