SketchingReality: From Freehand Scene Sketches To Photorealistic Images

Ahmed Bourouis; Mikhail Bessmeltsev; Yulia Gryaditskaya

arXiv:2602.14648·cs.CV·February 17, 2026

SketchingReality: From Freehand Scene Sketches To Photorealistic Images

Ahmed Bourouis, Mikhail Bessmeltsev, Yulia Gryaditskaya

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel method for generating photorealistic images from freehand sketches, balancing realism with sketch fidelity by using a modulation-based approach and a new loss function that does not require pixel-aligned ground truth.

Contribution

It presents a new approach that effectively handles true freehand sketches for image generation, overcoming the lack of pixel-aligned ground truth and improving semantic and visual quality.

Findings

01

Outperforms existing methods in semantic alignment with sketches

02

Produces more realistic and high-quality images from freehand sketches

03

Effective training without ground-truth pixel-aligned images

Abstract

Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The proposed method achieves impressive visual quality in its generated results, which is further supported by strong quantitative performance. 2. This work addresses a key challenge in sketch-based image generation, the inherent abstraction and ambiguity of sketches, which often leads to distortion in the results of existing methods. The proposed modulation network effectively addresses this issue by emphasizing the semantic structure of sketches. Consequently, the generated images achieve a

Weaknesses

1. The authors state that because a sketch can be abstract and ambiguous, their method focuses on extracting its semantic and structure information. In that case, it raise the question of whether an alternative approach, such as performing sketch captioning first and then feeding the resulting text into a standard T2I model (or baseline methods used in this paper), could be viable. A discussion or comparison against such a two-stage pipeline would be a valuable addition. 2. All experiments are c

Reviewer 02Rating 4Confidence 4

Strengths

The method handles abstract and deformable sketches for the scene-level sketch-to-photo generation. Attention supervision explicitly ties language tokens to spatial regions and the modulation head is lightweight and only active in early timesteps, which keeps computation modest. It plugs into a standard SD2.1 pipeline without re-architecting. This modularity makes the approach easy to reproduce.

Weaknesses

The paper does not offer a mechanistic explanation or theoritical analysis for why the noise-modulation head works. Evidence is largely empirical like metric tables and ablations without probing the internal reasons. Moreover, it does not clearly position the method against mainstream fine-tuning techniques such as LoRA comparison and the core distinction from other fine-tuning techniques remains under-analyzed. The paper does not clearly articulate the mechanistic between the noise-modulation

Reviewer 03Rating 6Confidence 3

Strengths

* Both theoretically and empirically, I found the attention supervision loss to be quite elegant and effective. I find this to be among the paper’s stronger contributions and a key step towards translating freehand sketches to semantically coherent images. * As with any good vision paper, I found the ablation experiments and comparisons to existing methods quite thorough and generally compelling. * While I do have issues with the general clarity of the methods overall, I found some of the overvi

Weaknesses

* The paper is at times hard to read because of its structure. For example, the modulation network is introduced before the reader knows what the sketch features that it uses are. I recommend laying out the necessary details of all the ‘ingredients’ of a module before diving into details about the module. * I think there needs to be more details of the user study in the main text. I also find 23 participants to be quite a small pool. * While I do find the interleaved discussion of the results in

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications