MVGamba: Unify 3D Content Generation as State Space Sequence Modeling
Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim,, Shuicheng Yan, Xinchao Wang, Hanwang Zhang

TL;DR
MVGamba introduces a lightweight, unified 3D content generation model using a state space sequence approach, improving multi-view consistency and detail in 3D reconstructions with lower computational costs.
Contribution
It proposes a novel Gaussian reconstruction model based on RNN-like State Space Models that enhances multi-view information propagation and integrates seamlessly with diffusion models.
Findings
Outperforms state-of-the-art in 3D generation tasks
Achieves high-quality 3D content with 0.1x model size
Improves multi-view consistency and detail in reconstructions
Abstract
Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (e.g., Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Handwritten Text Recognition Techniques · Human Motion and Animation
MethodsDiffusion
