LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
Zhengqin Li, Cheng Zhang, Jakob Engel, Zhao Dong

TL;DR
The paper presents LSRM, a scalable transformer-based model that significantly improves high-fidelity 3D object reconstruction and inverse rendering by expanding context windows and employing efficient sparse attention mechanisms.
Contribution
Introducing a novel scalable architecture with sparse attention, a coarse-to-fine pipeline, and geometric-aware routing to enhance 3D reconstruction and inverse rendering quality.
Findings
Handles 20x more object tokens than prior methods.
Achieves 2.5 dB higher PSNR in novel-view synthesis.
Reduces LPIPS by 40% compared to state-of-the-art.
Abstract
We introduce the Large Sparse Reconstruction Model to study how scaling transformer context windows impacts feed-forward 3D reconstruction. Although recent object-centric feed-forward methods deliver robust, high-quality reconstruction, they still lag behind dense-view optimization in recovering fine-grained texture and appearance. We show that expanding the context window -- by substantially increasing the number of active object and image tokens -- remarkably narrows this gap and enables high-fidelity 3D object reconstruction and inverse rendering. To scale effectively, we adapt native sparse attention in our architecture design, unlocking its capacity for 3D reconstruction with three key contributions: (1) an efficient coarse-to-fine pipeline that focuses computation on informative regions by predicting sparse high-resolution residuals; (2) a 3D-aware spatial routing mechanism that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
