studentSplat: Your Student Model Learns Single-view 3D Gaussian Splatting
Yimu Pan, Hongda Mao, Qingshuang Chen, Yelin Kim

TL;DR
studentSplat introduces a novel single-view 3D scene reconstruction method using Gaussian splatting, leveraging a teacher-student architecture and extrapolation network to overcome inherent ambiguities and achieve state-of-the-art results.
Contribution
It presents a new single-view 3D Gaussian splatting approach with a teacher-student framework and extrapolation network, addressing scale ambiguity and enabling high-quality scene extrapolation.
Findings
Achieves state-of-the-art single-view reconstruction quality.
Performs comparably to multi-view methods at scene level.
Demonstrates competitive self-supervised depth estimation.
Abstract
Recent advance in feed-forward 3D Gaussian splatting has enable remarkable multi-view 3D scene reconstruction or single-view 3D object reconstruction but single-view 3D scene reconstruction remain under-explored due to inherited ambiguity in single-view. We present \textbf{studentSplat}, a single-view 3D Gaussian splatting method for scene reconstruction. To overcome the scale ambiguity and extrapolation problems inherent in novel-view supervision from a single input, we introduce two techniques: 1) a teacher-student architecture where a multi-view teacher model provides geometric supervision to the single-view student during training, addressing scale ambiguity and encourage geometric validity; and 2) an extrapolation network that completes missing scene context, enabling high-quality extrapolation. Extensive experiments show studentSplat achieves state-of-the-art single-view…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The overall paper is well-written with the architectural designs mentioned in detail making the readers easy to understand the training procedure and contributions of the work. - Although the task being challenging, the proposed method shows strong performance, achieving state-of-the-art in multiple datasets. - The proposed method is efficient in terms of the model parameters and the number of Gaussians compared to previous methods.
- Mitigating the use of camera poses : The authors mention that the camera pose of a single image can be defined as the identity matrix, mitigating the use of camera poses of multi-view images. However, during the teacher-student geometric supervision, as MVSplat[1] has been trained on both RealEstate10K and ACID using SfM Camera poses, this supervision guides studentSplat to learn this SfM Camera Pose scales which enables the photometric loss of $L_{photo}$ with a specific relative camera pose
The idea of distilling knowledge from multi-view model to single-view model is simple and makes sense. The paper is clearly written. Studentsplat outperforms recent methods in single image rendering and monocular depth. The ablation study in main paper and supplementary is thorough.
I felt the evaluation is not very convincing. In single image setting, it is not surprising that pixelsplat and mvsplat perform worse since they rely on feature matching across different views, which is unavailable with single image (or let’s say two same images with baseline=0). For the evaluation with single image, my concerns mainly come from Fig. 3, 4, 7, 8. In these images, we can find that the viewpoints of target views and input context view are similar (i.e. the baseline between input vi
* The approach of leveraging geometric priors from multi-view reconstruction methods to enhance single-view reconstruction is intriguing, and the authors have experimentally demonstrated significant improvements in the perspective extrapolator through multi-view distillation. * The authors present a simple yet effective method for extrapolating when computing the novel view reconstruction loss. * The paper is well-written and easy to follow.
* **3D Consistency**: In the unseen regions when extrapolating new views, the rendered results depend on the 2D generative model MI-GAN. Therefore, I am skeptical about the model's ability to generate extrapolated continuous new views with 3D consistency. I recommend that the authors supplement the discussion with relevant visual results or theoretical analyses. * **Overclaim**: The authors state in the contributions section that they "propose the first single-view 3D scene Gaussian splatting mo
1. Clear writing 2. Experiments are conducted with large-scale datasets, and compared with existing SOTAs.
1. In page 1, line 38, why is it the first single-view 3D gaussian splatting method? I recall flash3D [1], the paper written by the same first author of the splatterimage, has already publicly available since June 2024, which is hard to miss. I understand that the paper seems to be yet published to any conferences or journals, but this does not mean that the authors can simply ignore this already existing paper and claim author's paper as the first approach. I recommend the authors to cite this
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Human Pose and Action Recognition
