OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han; Bin Zhu; Shiqi Hu; Franklin Mingzhe Li; Patrick Carrington; Roger Zimmermann; Jingjing Chen

arXiv:2603.11698·cs.CV·April 20, 2026

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen

PDF

1 Datasets

TL;DR

This paper introduces OSCBench, a benchmark for evaluating object state change understanding in text-to-video models, revealing current models' struggles with accurate, consistent object transformations.

Contribution

The paper presents OSCBench, a novel benchmark for assessing object state change in T2V models, highlighting a critical gap in current evaluation methods.

Findings

01

Current T2V models excel at scene and semantic alignment.

02

Models struggle with accurate, temporally consistent object state changes.

03

Performance drops in novel and compositional scenarios.

Abstract

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

XianjingHan/OSCBench_Dataset
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.