Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun, Hao Chen, and Chunhua Shen

TL;DR
This paper systematically investigates critical factors affecting 3D visual geometry estimation, revealing data quality, supervision strategies, and high-resolution inputs significantly influence model performance.
Contribution
The authors introduce CARVE, a resolution-enhanced model that integrates new consistency loss and architectural improvements for better 3D geometry estimation.
Findings
Scaling data diversity improves performance.
Certain loss mechanisms may hinder accuracy.
Joint supervision enhances results, local region alignment can degrade.
Abstract
Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
