TL;DR
GemDepth introduces a geometry-aware framework for 3D-consistent video depth estimation, leveraging explicit camera motion and geometric embeddings to improve detail and temporal coherence.
Contribution
It proposes a novel Geometry-Embedding Module and an Alternating Spatio-Temporal Transformer to enhance 3D consistency and spatial detail in video depth estimation.
Findings
Achieves state-of-the-art performance on multiple datasets.
Effectively maintains 3D geometric consistency under view changes.
Improves spatial detail and temporal coherence simultaneously.
Abstract
Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
