Taming generative video models for zero-shot optical flow extraction

Seungwoo Kim; Khai Loong Aw; Klemen Kotar; Cristobal Eyzaguirre; Wanhee Lee; Yunong Liu; Jared Watrous; Stefan Stojanov; Juan Carlos Niebles; Jiajun Wu; Daniel L. K. Yamins

arXiv:2507.09082·cs.CV·December 1, 2025

Taming generative video models for zero-shot optical flow extraction

Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins

PDF

1 Video

TL;DR

This paper demonstrates that large, self-supervised, generative video models with specific architectural properties can be prompted to extract optical flow in a zero-shot manner, eliminating the need for fine-tuning.

Contribution

The authors introduce KL-tracing, a novel test-time inference method leveraging the LRAS architecture for zero-shot optical flow extraction from generative video models.

Findings

01

Competitive performance on TAP-Vid benchmarks without fine-tuning

02

Successful zero-shot flow extraction using properties of LRAS models

03

KL-tracing outperforms some task-specific methods in certain scenarios

Abstract

Extracting optical flow from videos remains a core computer vision problem. Motivated by the recent success of large general-purpose models, we ask whether frozen self-supervised video models trained only to predict future frames can be prompted, without fine-tuning, to output flow. Prior attempts to read out depth or illumination from video generators required fine-tuning; that strategy is ill-suited for flow, where labeled data is scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models for zero-shot flow extraction. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Taming generative video models for zero-shot optical flow extraction· slideslive