Flow and Depth Assisted Video Prediction with Latent Transformer

Eliyas Suleyman; Paul Henderson; Eksan Firkat; Nicolas Pugeault

arXiv:2511.16484·cs.CV·November 21, 2025

Flow and Depth Assisted Video Prediction with Latent Transformer

Eliyas Suleyman, Paul Henderson, Eksan Firkat, Nicolas Pugeault

PDF

Open Access

TL;DR

This paper introduces a novel approach for occluded video prediction by integrating depth and point-flow information into a latent transformer model, significantly improving prediction accuracy in occlusion scenarios.

Contribution

The study systematically explores the use of depth and point-flow data to enhance video prediction models, addressing occlusion challenges in both synthetic and real-world datasets.

Findings

01

Improved prediction accuracy with depth and point-flow assistance.

02

Better modeling of background motion in occluded scenarios.

03

Enhanced performance measured by appearance metrics and Wasserstein distances.

Abstract

Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging