Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Efstathios Karypidis; Spyros Gidaris; Nikos Komodakis

arXiv:2604.11707·cs.CV·April 14, 2026

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Efstathios Karypidis, Spyros Gidaris, Nikos Komodakis

PDF

1 Repo

TL;DR

Re2Pix is a hierarchical video prediction framework that forecasts scene semantics in feature space and then synthesizes photorealistic frames, improving temporal consistency and quality in complex environments.

Contribution

The paper introduces Re2Pix, a novel two-stage approach combining semantic prediction and visual synthesis, with strategies to handle train-test representation mismatch.

Findings

01

Significantly improves temporal semantic consistency in video prediction.

02

Enhances perceptual quality of generated videos.

03

Increases training efficiency over baseline diffusion models.

Abstract

Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Sta8is/Re2Pix
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.