SirenPose: Dynamic Scene Reconstruction via Geometric Supervision

Kaitong Cai; Jensen Zhang; Jing Yang; Keze Wang

arXiv:2512.20531·cs.CV·December 24, 2025

SirenPose: Dynamic Scene Reconstruction via Geometric Supervision

Kaitong Cai, Jensen Zhang, Jing Yang, Keze Wang

PDF

Open Access

TL;DR

SirenPose is a novel method for dynamic scene reconstruction from monocular videos that combines geometric supervision, physics-inspired constraints, and high-frequency signal modeling to achieve accurate, consistent, and detailed 3D reconstructions.

Contribution

It introduces a geometry-aware loss with sinusoidal networks, expands the UniKPT dataset, and employs graph neural networks for improved dynamic scene and pose estimation.

Findings

01

Outperforms state-of-the-art on Sintel, Bonn, and DAVIS benchmarks.

02

Reduces FVD by 17.8%, FID by 28.7%, and improves LPIPS by 6%.

03

Enhances temporal consistency and geometric accuracy in dynamic scenes.

Abstract

We introduce SirenPose, a geometry-aware loss formulation that integrates the periodic activation properties of sinusoidal representation networks with keypoint-based geometric supervision, enabling accurate and temporally consistent reconstruction of dynamic 3D scenes from monocular videos. Existing approaches often struggle with motion fidelity and spatiotemporal coherence in challenging settings involving fast motion, multi-object interaction, occlusion, and rapid scene changes. SirenPose incorporates physics inspired constraints to enforce coherent keypoint predictions across both spatial and temporal dimensions, while leveraging high frequency signal modeling to capture fine grained geometric details. We further expand the UniKPT dataset to 600,000 annotated instances and integrate graph neural networks to model keypoint relationships and structural correlations. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Human Pose and Action Recognition