Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
Zhijiang Tang, Jiaxin Qi, Bing Zhao, Jianqiang Huang

TL;DR
This paper introduces Long-CODE, a new framework and benchmark for evaluating long-video generation quality, focusing on long-range attributes like narrative consistency, which existing short-video metrics overlook.
Contribution
The paper proposes a novel long-video evaluation metric based on shot dynamics and introduces a dedicated dataset with human annotations for long-range video assessment.
Findings
The new metric correlates highly with human judgments.
Existing short-video metrics are insensitive to structural long-range inconsistencies.
Long-CODE provides a comprehensive benchmark for long-video evaluation.
Abstract
As video generation models achieve unprecedented capabilities, the demand for robust video evaluation metrics becomes increasingly critical. Traditional metrics are intrinsically tailored for short-video evaluation, predominantly assessing frame-level visual quality and localized temporal smoothness. However, as state-of-the-art video generation models scale to generate longer videos, these metrics fail to capture essential long-range characteristics, such as narrative richness and global causal consistency. Recognizing that short-term visual perception and long-context attributes are fundamentally orthogonal dimensions, we argue that long-video metrics should be disentangled from short-video assessments. In this paper, we focus on the rigorous justification and design of a dedicated framework for long-video evaluation. We first introduce a suite of long-video attribute corruption…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
