GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors
Xingyilang Yin, Qi Zhang, Jiahao Chang, Ying Feng, Qingnan Fan, Xi Yang, Chi-Man Pun, Huaqi Zhang, Xiaodong Cun

TL;DR
GSFixer is a novel framework that uses reference-guided video diffusion priors to improve 3D Gaussian Splatting reconstructions from sparse views, enhancing artifact removal and 3D consistency.
Contribution
We introduce GSFixer, a reference-guided video diffusion approach that leverages semantic and geometric features for improved 3D scene reconstruction from sparse views.
Findings
GSFixer outperforms state-of-the-art methods in artifact restoration.
Our method enhances semantic coherence and 3D consistency.
The new DL3DV-Res benchmark facilitates evaluation of 3DGS artifact restoration.
Abstract
Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an ill-posed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic features and 3D geometric features of reference views…
Peer Reviews
Decision·Submitted to ICLR 2026
**Task framing + benchmark.** Explicitly casting “artifact restoration” for sparse-view 3DGS and introducing **DL3DV-Res** gives a concrete testbed; the task is well motivated, and the dataset construction is described. Reported scores show clear, if not dramatic, gains over prior generative baselines on this benchmark. - **Dual conditioning (2D + 3D tokens).** Conditioning the video diffusion on DINOv2 (semantics) and VGGT (geometry) is a coherent way to push consistency to the fixed frames, a
**Incremental relative to recent generative NVS systems** - The method largely leverages known techniques, including video diffusion restoration, iterative distillation back to the 3D representation, and simple camera path sampling, and is very close in spirit to GenFusion/Difix3D+. ** The paper’s own related-work section positions it as a variation rather than a conceptual step change. The only new contribution is the type of conditioning signal used in the fix step, making this an incremental
I think the most original contribution is the reference-guided video restoration model, which conditions a video diffusion model on both 2D semantic features (via DINOv2) and 3D geometric features (via VGGT) extracted from the input sparse views. Different from previous works, the condition is multimodal. The introduction of the RGT strategy is a clever way to refine the 3DGS systematically.
The scales, dimensions, and information densities of the geometric features (VGGT) and semantic features (DINOv2) may differ, and improper fusion may result in the weakening of one of the features. How to confirm that the fusion is correct for both features? Did not see the ablation studies of this part.
1. **Originality-wise**: the paper proposes a novel framework for robust 3D reconstruction. 2. **Quality-wise**: the proposed method has combine Dinov2 and VGGT head to extract the 2D/3D features to fix the 3D scenes with gaussian splatting. 3. **Clarity-wise**: the manuscript is clearly written, with well-structured methodology, detailed explanations, and intuitive visualizations that enhance understanding.
1. **Limited generalization of camera trajectories.** The reference-guided trajectory strategy performs well on circumferential trajectories, but its adaptability and effectiveness on non-circumferential or sparse, discontinuous trajectories (such as extreme angles and non-closed-loop aerial photography) have not been fully verified. The model may be sensitive to the distribution of reference viewpoints and lack the ability to generalize to diverse trajectories. More exps and analysis in complex
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
