UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

Junhwa Hur; Charles Herrmann; Songyou Peng; Philipp Henzler; Zeyu Ma; Todd Zickler; Deqing Sun

arXiv:2602.24290·cs.CV·March 6, 2026

UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

Junhwa Hur, Charles Herrmann, Songyou Peng, Philipp Henzler, Zeyu Ma, Todd Zickler, Deqing Sun

PDF

Open Access 3 Reviews

TL;DR

UFO-4D is a novel feedforward framework that reconstructs dense 4D scenes from two unposed images by jointly estimating geometry, motion, and camera pose using differentiable rendering of dynamic Gaussian primitives.

Contribution

It introduces a unified, self-supervised feedforward method for 4D reconstruction from unposed images, outperforming prior approaches by leveraging differentiable rendering and shared primitives.

Findings

01

Outperforms prior methods by up to 3 times in accuracy.

02

Enables high-fidelity 4D interpolation across views and time.

03

Achieves joint estimation of geometry, motion, and camera pose in a single framework.

Abstract

Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper focuses on a cutting-edge task that directly predicts Gaussian attributes from unposed images. The task is applicable of many downstream applications. 2. This paper implement a standard and simple network with pure ViTs, which can be potentially scaled up to large datasets and parameters.

Weaknesses

1. The results are not convincing. This paper does not show enough visualization on the rendering quality of the reconstructed 3DGS at different novel views. And no videos are provided as the supplementary for a clear visualization on the reconstruction quality. So I am not sure whether the generated 3DGS are of high-quality. Given the current figures, I do not see significant performance gain compared to only predict colored points, which weakens the insight of predicting Gaussian primitives. 2

Reviewer 02Rating 4Confidence 4

Strengths

- **Unified feedforward architecture:** Unlike previous works that obtain geometry and motion estimation with separate steps, the proposed UFO-4D offers an unified 4D representation capable of solving multiple downstream perception tasks jointly which has been found to be beneficial in other domains. - **Leveraging 4DGS representation:** By leveraging 4D Gaussian Splatting as the representation, this allows the model to be trained with additional photometric supervision, and also allow other est

Weaknesses

- **Lack of direct comparison:** Although I agree that leveraging a representation that tightly couples both pointmap and motion estimation will further boost performance, it is difficult to understand the benefits that come from adopting this new representation itself as the proposed method also utilizes a very recently open-sourced dataset Stereo4D for training. It would be nice to include a comparison of baseline method trained with the same dataset recipe to better highlight the advantages o

Reviewer 03Rating 8Confidence 4

Strengths

Although there are several papers tackling the general problem of 4D estimation (many correctly mentioned in the related work), this contribution stands out as a simple idea that is very well executed. Namely, the idea of rasterizing geometric and motion information into 2D to supervise it with scene flow and point maps (i.e. LIDAR) from real datasets seems somewhat novel (if a little obvious in hindsight, like all good ideas). This differs from approaches that demand 3D supervision, which often

Weaknesses

For being based on NoPoSplat and MASt3r, a direct comparison with those methods where appropriate is needed. This would quantify the actual gains the method brings. However, it is understood that broadening the applicability of the original moodels can be a sufficient justification for the proposal. As a minor aside on clarity, the smoothness loss should be written explicitly to make the paper self-contained, since it is an important factor for performance.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis