ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

Mingyu Dong; Chong Xia; Mingyuan Jia; Weichen Lyu; Long Xu; Zheng Zhu; Yueqi Duan

arXiv:2604.10789·cs.CV·April 14, 2026

ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

Mingyu Dong, Chong Xia, Mingyuan Jia, Weichen Lyu, Long Xu, Zheng Zhu, Yueqi Duan

PDF

TL;DR

ReplicateAnyScene is a zero-shot framework that transforms casual videos into structured 3D scenes by aligning textual, visual, and spatial information, advancing spatial intelligence.

Contribution

It introduces a fully automated, zero-shot pipeline with a five-stage cascade leveraging vision foundation models for 3D scene reconstruction from videos.

Findings

01

Outperforms existing methods in generating high-quality 3D scenes

02

Achieves semantic coherence and physical plausibility in reconstructions

03

Introduces the C3DR benchmark for comprehensive evaluation

Abstract

Humans exhibit an innate capacity to rapidly perceive and segment objects from video observations, and even mentally assemble them into structured 3D scenes. Replicating such capability, termed compositional 3D reconstruction, is pivotal for the advancement of Spatial Intelligence and Embodied AI. However, existing methods struggle to achieve practical deployment due to the insufficient integration of cross-modal information, leaving them dependent on manual object prompting, reliant on auxiliary visual inputs, and restricted to overly simplistic scenes by training biases. To address these limitations, we propose ReplicateAnyScene, a framework capable of fully automated and zero-shot transformation of casually captured videos into compositional 3D scenes. Specifically, our pipeline incorporates a five-stage cascade to extract and structurally align generic priors from vision foundation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.