EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez

TL;DR
EntityBench is a comprehensive benchmark for evaluating long-range multi-shot video generation, emphasizing entity consistency across complex visual narratives.
Contribution
The paper introduces EntityBench, a new dataset and evaluation suite, along with EntityMem, a memory-augmented generation system for improved entity consistency.
Findings
Cross-shot entity consistency decreases with recurrence distance in existing methods.
Explicit per-entity memory significantly improves character fidelity and presence.
EntityMem outperforms baseline methods in maintaining entity consistency.
Abstract
Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
