4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Chaoyang Wang; Ashkan Mirzaei; Vidit Goel; Willi Menapace; Aliaksandr Siarohin; Avalon Vinella; Michael Vasilkovsky; Ivan Skorokhodov; Vladislav Shakhrai; Sergey Korolev; Sergey Tulyakov; Peter Wonka

arXiv:2506.18839·cs.CV·June 24, 2025

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Peter Wonka

PDF

TL;DR

This paper introduces a novel 4D scene generation framework that fuses spatial and temporal attention within a single layer and extends 3D reconstruction with Gaussian modeling, achieving state-of-the-art results.

Contribution

The paper presents a fused 4D attention architecture and enhanced 3D reconstruction methods, advancing the capabilities of 4D scene generation.

Findings

01

Achieved state-of-the-art 4D generation quality.

02

Improved 3D reconstruction accuracy.

03

Efficient sparse attention pattern for 4D data.

Abstract

We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion