Any Resolution Any Geometry: From Multi-View To Multi-Patch
Wenqing Cui, Zhenyu Li, Mykola Lavreniuk, Jian Shi, Ramzi Idoughi, Xiangjun Tang, Peter Wonka

TL;DR
The paper introduces URGT, a multi-patch transformer that enhances high-resolution depth and normal estimation by processing image patches with cross-patch attention and a novel sampling strategy, achieving state-of-the-art results.
Contribution
It proposes a unified multi-patch transformer framework with cross-patch attention and GridMix sampling for improved high-resolution 3D geometry estimation.
Findings
Achieves state-of-the-art on UnrealStereo4K with significant metric improvements.
Demonstrates strong zero-shot and cross-domain generalization.
Scales effectively to very high resolutions.
Abstract
Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization
