OSCBench: Benchmarking Object State Change in Text-to-Video Generation
Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen

TL;DR
This paper introduces OSCBench, a benchmark for evaluating object state change understanding in text-to-video models, revealing current models' struggles with accurate, consistent object transformations.
Contribution
The paper presents OSCBench, a novel benchmark for assessing object state change in T2V models, highlighting a critical gap in current evaluation methods.
Findings
Current T2V models excel at scene and semantic alignment.
Models struggle with accurate, temporally consistent object state changes.
Performance drops in novel and compositional scenarios.
Abstract
Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
