Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow
Shenhan Qian, Ganlin Zhang, Shangzhe Wu, Daniel Cremers

TL;DR
Flow4R introduces a unified scene flow-based framework that simultaneously reconstructs and tracks dynamic 3D scenes, outperforming existing methods by integrating geometry and motion estimation in a single model.
Contribution
The paper presents Flow4R, a novel approach that uses scene flow as the core representation, enabling joint 4D reconstruction and tracking without explicit pose estimation or bundle adjustment.
Findings
Achieves state-of-the-art results on 4D reconstruction tasks.
Effectively handles static and dynamic scenes through joint training.
Operates with a single forward pass using a Vision Transformer.
Abstract
Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Human Pose and Action Recognition
