CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Yanqing Liu; Yingcheng Liu; Fanghong Dong; Budianto Budianto; Cihang Xie; Yan Jiao

arXiv:2603.08648·cs.CV·March 10, 2026

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao

PDF

Open Access

TL;DR

This paper introduces CAST, a novel adapter that models visual state transitions to improve the consistency of long-form video retrieval and generation, addressing the limitations of context-agnostic retrieval methods.

Contribution

We formalize the task of Consistent Video Retrieval and propose CAST, a plug-and-play adapter that enhances latent state modeling in vision-language embeddings for better video retrieval.

Findings

01

CAST improves retrieval performance on YouCook2 and CrossTask datasets.

02

CAST outperforms zero-shot baselines across various backbone models.

03

CAST provides effective reranking signals for black-box video generation.

Abstract

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ( $Δ$ ) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques