See4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Dongyue Lu; Ao Liang; Tianxin Huang; Xiao Fu; Yuyang Zhao; Baorui Ma; Liang Pan; Wei Yin; Lingdong Kong; Wei Tsang Ooi; Ziwei Liu

arXiv:2510.26796·cs.CV·March 13, 2026

See4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, Ziwei Liu

PDF

TL;DR

See4D introduces a pose-free 4D video generation method that uses virtual cameras and inpainting to synthesize spatiotemporal content without explicit 3D supervision, improving robustness and generalization.

Contribution

The paper proposes a novel pose-free, trajectory-to-camera framework with a view-conditional inpainting model and autoregressive inference, enabling 4D scene synthesis from casual videos without 3D annotations.

Findings

01

Outperforms pose- and trajectory-conditioned baselines in benchmarks

02

Achieves superior generalization and scene coherence

03

Effectively synthesizes 4D content from in-the-wild videos

Abstract

Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce See4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.