Pixel-Perfect Visual Geometry Estimation

Gangwei Xu; Haotong Lin; Hongcheng Luo; Haiyang Sun; Bing Wang; Guang Chen; Sida Peng; Hangjun Ye; Xin Yang

arXiv:2601.05246·cs.CV·January 9, 2026

Pixel-Perfect Visual Geometry Estimation

Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang

PDF

Open Access

TL;DR

This paper introduces pixel-perfect visual geometry models that leverage generative diffusion transformers to produce high-quality, flying-pixel-free point clouds for images and videos, significantly improving detail preservation and accuracy.

Contribution

The paper presents novel diffusion transformer-based models for monocular and video depth estimation, incorporating semantic prompting and cascade architectures for enhanced efficiency and detail.

Findings

01

Achieves state-of-the-art performance in monocular and video depth estimation.

02

Produces significantly cleaner and more accurate point clouds.

03

Introduces efficient methods for temporal coherence in video depth estimation.

Abstract

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis