Sharp Monocular View Synthesis in Less Than a Second

Lars Mescheder; Wei Dong; Shiwei Li; Xuyang Bai; Marcel Santos; Peiyun Hu; Bruno Lecouat; Mingmin Zhen; Ama\"el Delaunoy; Tian Fang; Yanghai Tsin; Stephan R. Richter; Vladlen Koltun

arXiv:2512.10685·cs.CV·March 2, 2026

Sharp Monocular View Synthesis in Less Than a Second

Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Ama\"el Delaunoy, Tian Fang, Yanghai Tsin, Stephan R. Richter, Vladlen Koltun

PDF

Open Access 6 Models 3 Reviews

TL;DR

SHARP is a fast, photorealistic view synthesis method from a single image that produces a 3D Gaussian scene representation in under a second, enabling real-time rendering with high accuracy.

Contribution

It introduces a novel, efficient 3D Gaussian scene representation regressed from a single image, enabling rapid, photorealistic view synthesis with metric scale and zero-shot generalization.

Findings

01

Reduces LPIPS by 25-34% compared to prior models

02

Lowers DISTS by 21-43% versus previous state-of-the-art

03

Achieves real-time rendering in less than a second

Abstract

We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The task is a promising way for VR/AR applications. This paper focuses on a cutting-edge field. 2. The results are convincing which perform many video based methods that require costly inference.

Weaknesses

1. The authors should provide a video in the supplementary for a more clear comparison with SOTA methods. Since the method outputs a 3DGS, it is more convincing to attach a video showing novel view synthesis results of the 3DGS. 2. How large offset range can the model handle? For the regions that are not visible in the current image, does the model has the capability to generatively infer the occlusions and scene extensions? 3. More recent works like See3D should be compared. My most concern l

Reviewer 02Rating 4Confidence 4

Strengths

1. The proposed method is fast and efficient, while achieving high-quality results. 2. The experimental results demonstrate strong performance across multiple datasets and metrics. 3. The writing is clear and the engineering contributions are solid.

Weaknesses

1. The work is more like a system engineering paper rather than a novel research contribution. The scientific novelty is limited. The authors should better highlight the key innovations. 2. It's better that the authors can provide video results to showcase the real-time rendering capabilities. 3. The font used in the paper seems to be non-standard.

Reviewer 03Rating 4Confidence 4

Strengths

Novel combination of monocular depth inference and 3D Gaussian Splatting with impressive speed and fidelity. Clear architecture and training pipeline; loss design and curriculum are well justified. Extensive comparisons across datasets and perceptual metrics.

Weaknesses

- Novel-view range unclear. The paper does not specify how far target views are from the input. Report actual displacement (e.g., angle, translation) and analyze performance versus view distance. - View-to-view consistency. Since only one novel view for each scene is reported, temporal stability across continuous camera motion is unknown. Evaluating flickering with continuous multiple novel view renderings for frame-to-frame consistency is desired. - Multi-view generalization. Can SHARP

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis