4DNeX: Feed-Forward 4D Generative Modeling Made Easy
Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, Ziwei Liu

TL;DR
4DNeX introduces an efficient, end-to-end feed-forward framework for generating 4D dynamic scene representations from a single image, leveraging a large dataset and a unified 6D video model.
Contribution
It presents a novel unified 6D video representation and adaptation strategies for pretrained diffusion models, enabling scalable 4D scene generation from minimal input.
Findings
Outperforms existing methods in efficiency and generalizability
Creates high-quality dynamic point clouds from single images
Introduces 4DNeX-10M dataset with high-quality annotations
Abstract
We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper tackles an important and challenging problem - generating 4D point clouds from a single image with pretrained video diffusion model. - The construction of a dataset with 9.2M+ frames from diverse sources (DL3DV, RE10K, Pexels, Vimeo, VDM) addresses the scarcity of 4D training data. - 15-minute generation vs >1 hour for Free4D demonstrates practical advantages. - The progression from motivation to method to results is logical and easy to follow.
- The dataset relies entirely on pseudo-annotations from MonST3R, MegaSaM, and DUSt3R, yet there is no quantitative validation of these annotations against DROID-SLAM[1] which is use in many recent works as ground truth for 3D dynamic scenes. - Unclear 3D point-cloud (XYZ) quality; more XYZ visualizations and comparisons to other methods would help. - Accuracy of the post-optimized camera parameters is not analyzed. - User study details are missing (e.g., number of participants). - It’s unclear
1)The method is conceptually simple and easy to follow. 2)The work targets a feed-forward model for image-to-4D generation, which is a practical and relevant research topic. 3)Experimental results show that the proposed method achieves state-of-the-art performance in its evaluated settings.
1) The paper does not mention Aether, which, to my knowledge, is the first work that leverages video diffusion models for joint RGB and geometry (camera ray maps and depth maps) prediction, and also supports image-conditioned 4D generation, albeit as a world model. The authors should discuss the differences, advantages, and disadvantages relative to Aether, and, if possible, include comparison results to strengthen the work. 2) For annotation, the authors rely solely on feed-forward models (Mon
1. The authors introduce a successful strategy for adapting existing, powerful video diffusion models to the 4D domain. The systematic investigation into fusion mechanisms, culminating in the adoption of width-wise fusion, is an important technical finding for jointly modeling appearance and geometry sequences 2. 4DNeX is the first feed-forward method to tackle the challenging task of single-image-to-4D generation. This feed-forward nature makes it significantly more efficient than computational
1. The fundamental technical contribution is primarily an engineering adaptation of a pretrained video generation model to a new input/output domain. As the reviewer notes, the RGB-XYZ representation itself is not new (Zhang et al.). The novelty lies in the generation paradigm (image-to-4D, feed-forward), but the algorithmic advancements beyond the fine-tuning strategies are limited, which may lead to a low technical score 2. The large-scale training is entirely dependent on pseudo-4D annotation
1. “RGB + XYZ as 6D video” is conceptually elegant — unifies appearance & geometry without NeRF volume rendering or Gaussian splats during training. 2. Width-wise fusion is empirically validated and actually justified with token interaction distance。 3. LoRA-only tuning on 14B Wan2.1 while preserving pretrained RGB distributions is a strong practical insight, compared to Cat4D / Free4D which lose RGB appearance fidelity when optimizing all parameters. 4. Post-optimization for camera recovery
1. Core novelty is incremental, effectively “fine-tuning Wan2.1 to predict XYZ instead of RGB” with latent concatenation and normalization. There is no fundamentally new 4D generative architecture. This feels closer to a careful repurposing of a large pretrained model rather than a new generative paradigm. 2. No true camera control or 3D consistency is learned during generation itself. The “feed-forward” claim is slightly misleading — XYZ is predicted in image-plane coordinates, not explicitly
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Modeling in Geospatial Applications · Simulation Techniques and Applications
