TL;DR
Mix3R is a novel framework that combines feed-forward and generative 3D reconstruction methods to produce aligned 3D shapes and accurate camera poses, leveraging pretrained models and mutual benefits.
Contribution
It introduces a Mixture-of-Transformers architecture that jointly generates aligned sparse voxels, point maps, and textured geometry, improving 3D shape and pose accuracy.
Findings
Produces better input-aligned 3D shapes than pure generative methods.
Achieves more accurate camera pose estimations than previous feed-forward methods.
Effectively integrates pretrained priors for improved 3D reconstruction.
Abstract
Recent trends in sparse-view 3D reconstruction have taken two different paths: feed-forward reconstruction that predicts pixel-aligned point maps without a complete geometry, and generative 3D reconstruction that generates complete geometry but often with poor input-alignment. We present Mix3R, a novel generative 3D reconstruction method which mixes feed-forward reconstruction and 3D generation into a single framework in an aligned manner. Mix3R generates a 3D shape in two stages: a sparse voxel generation stage and a textured geometry generation stage. Unlike pure generative methods, our first-stage generation jointly produces a coarse 3D structure (sparse voxels), per-view point maps and camera parameters aligned to that 3D structure. This is made possible by introducing a Mixture-of-Transformers architecture that inserts global self-attentions to a feed-forward reconstruction model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
