TL;DR
This paper demonstrates that large, self-supervised, generative video models with specific architectural properties can be prompted to extract optical flow in a zero-shot manner, eliminating the need for fine-tuning.
Contribution
The authors introduce KL-tracing, a novel test-time inference method leveraging the LRAS architecture for zero-shot optical flow extraction from generative video models.
Findings
Competitive performance on TAP-Vid benchmarks without fine-tuning
Successful zero-shot flow extraction using properties of LRAS models
KL-tracing outperforms some task-specific methods in certain scenarios
Abstract
Extracting optical flow from videos remains a core computer vision problem. Motivated by the recent success of large general-purpose models, we ask whether frozen self-supervised video models trained only to predict future frames can be prompted, without fine-tuning, to output flow. Prior attempts to read out depth or illumination from video generators required fine-tuning; that strategy is ill-suited for flow, where labeled data is scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models for zero-shot flow extraction. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
