GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos

Minghan Li; Tongna Chen; Tianrui Lv; Yishuai Zhang; Suchao An; Guodong Zhou

arXiv:2603.14426·cs.CV·March 17, 2026

GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos

Minghan Li, Tongna Chen, Tianrui Lv, Yishuai Zhang, Suchao An, Guodong Zhou

PDF

Open Access

TL;DR

GenState-AI introduces a novel benchmark for text-to-video retrieval emphasizing controlled state transitions and end-state grounding, revealing challenges in temporal and semantic discrimination for current models.

Contribution

The paper presents GenState-AI, a new dataset and evaluation framework focused on state-aware retrieval with fine-grained temporal and semantic distinctions, and provides diagnostic analyses of model failures.

Findings

01

Models often confuse temporal hard negatives with main videos.

02

Models tend to over-prefer temporally plausible but incorrect end-states.

03

Semantic substitutions are less influential in model confusion patterns.

Abstract

Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis