Future Optical Flow Prediction Improves Robot Control & Video Generation

Kanchana Ranasinghe; Honglu Zhou; Yu Fang; Luyu Yang; Le Xue; Ran Xu; Caiming Xiong; Silvio Savarese; Michael S Ryoo; Juan Carlos Niebles

arXiv:2601.10781·cs.CV·January 19, 2026

Future Optical Flow Prediction Improves Robot Control & Video Generation

Kanchana Ranasinghe, Honglu Zhou, Yu Fang, Luyu Yang, Le Xue, Ran Xu, Caiming Xiong, Silvio Savarese, Michael S Ryoo, Juan Carlos Niebles

PDF

Open Access 1 Models

TL;DR

FOFPred is a novel language-conditioned optical flow forecasting model that combines a Vision-Language Model and Diffusion architecture, trained on web-scale data, to improve control and video generation tasks.

Contribution

It introduces a unified VLM-Diffusion architecture for future optical flow prediction, trained on noisy web data, enabling strong multimodal reasoning and generalization.

Findings

01

Effective in robotic manipulation tasks

02

Enhances video generation quality

03

Demonstrates cross-domain versatility

Abstract

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Salesforce/FOFPred
model· 55 dl· ♡ 3
55 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis