Sharp Monocular View Synthesis in Less Than a Second
Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Ama\"el Delaunoy, Tian Fang, Yanghai Tsin, Stephan R. Richter, Vladlen Koltun

TL;DR
SHARP is a fast, photorealistic view synthesis method from a single image that produces a 3D Gaussian scene representation in under a second, enabling real-time rendering with high accuracy.
Contribution
It introduces a novel, efficient 3D Gaussian scene representation regressed from a single image, enabling rapid, photorealistic view synthesis with metric scale and zero-shot generalization.
Findings
Reduces LPIPS by 25-34% compared to prior models
Lowers DISTS by 21-43% versus previous state-of-the-art
Achieves real-time rendering in less than a second
Abstract
We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at…
Peer Reviews
Decision·ICLR 2026 Poster
1. The task is a promising way for VR/AR applications. This paper focuses on a cutting-edge field. 2. The results are convincing which perform many video based methods that require costly inference.
1. The authors should provide a video in the supplementary for a more clear comparison with SOTA methods. Since the method outputs a 3DGS, it is more convincing to attach a video showing novel view synthesis results of the 3DGS. 2. How large offset range can the model handle? For the regions that are not visible in the current image, does the model has the capability to generatively infer the occlusions and scene extensions? 3. More recent works like See3D should be compared. My most concern l
1. The proposed method is fast and efficient, while achieving high-quality results. 2. The experimental results demonstrate strong performance across multiple datasets and metrics. 3. The writing is clear and the engineering contributions are solid.
1. The work is more like a system engineering paper rather than a novel research contribution. The scientific novelty is limited. The authors should better highlight the key innovations. 2. It's better that the authors can provide video results to showcase the real-time rendering capabilities. 3. The font used in the paper seems to be non-standard.
Novel combination of monocular depth inference and 3D Gaussian Splatting with impressive speed and fidelity. Clear architecture and training pipeline; loss design and curriculum are well justified. Extensive comparisons across datasets and perceptual metrics.
- Novel-view range unclear. The paper does not specify how far target views are from the input. Report actual displacement (e.g., angle, translation) and analyze performance versus view distance. - View-to-view consistency. Since only one novel view for each scene is reported, temporal stability across continuous camera motion is unknown. Evaluating flickering with continuous multiple novel view renderings for frame-to-frame consistency is desired. - Multi-view generalization. Can SHARP
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
