Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

Seong Hyeon Park; Jinwoo Shin

arXiv:2505.01737·cs.CV·July 29, 2025

Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

Seong Hyeon Park, Jinwoo Shin

PDF

Open Access

TL;DR

This paper introduces MMP, a feed-forward model that estimates dynamic scene geometry from monocular videos by evolving pointmap representations over multiple frames, outperforming previous methods.

Contribution

The paper proposes a novel trajectory encoding module within a Siamese architecture for improved dynamic scene geometry estimation in a feed-forward manner.

Findings

01

Achieves 15.1% lower regression error compared to previous methods.

02

Produces state-of-the-art quality in multi-frame pointmap prediction.

03

Operates efficiently without heavy test-time optimization.

Abstract

In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes are inherently noisy under multiple frames, test-time global optimizations are often employed to fully recover the geometry, which is liable to failure and incurs heavy inference costs. To address the challenge, we present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce a new trajectory encoding module to project…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Human Motion and Animation