CAST: Modeling Visual State Transitions for Consistent Video Retrieval
Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao

TL;DR
This paper introduces CAST, a novel adapter that models visual state transitions to improve the consistency of long-form video retrieval and generation, addressing the limitations of context-agnostic retrieval methods.
Contribution
We formalize the task of Consistent Video Retrieval and propose CAST, a plug-and-play adapter that enhances latent state modeling in vision-language embeddings for better video retrieval.
Findings
CAST improves retrieval performance on YouCook2 and CrossTask datasets.
CAST outperforms zero-shot baselines across various backbone models.
CAST provides effective reranking signals for black-box video generation.
Abstract
As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update () from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
