SketchingReality: From Freehand Scene Sketches To Photorealistic Images
Ahmed Bourouis, Mikhail Bessmeltsev, Yulia Gryaditskaya

TL;DR
This paper introduces a novel method for generating photorealistic images from freehand sketches, balancing realism with sketch fidelity by using a modulation-based approach and a new loss function that does not require pixel-aligned ground truth.
Contribution
It presents a new approach that effectively handles true freehand sketches for image generation, overcoming the lack of pixel-aligned ground truth and improving semantic and visual quality.
Findings
Outperforms existing methods in semantic alignment with sketches
Produces more realistic and high-quality images from freehand sketches
Effective training without ground-truth pixel-aligned images
Abstract
Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed method achieves impressive visual quality in its generated results, which is further supported by strong quantitative performance. 2. This work addresses a key challenge in sketch-based image generation, the inherent abstraction and ambiguity of sketches, which often leads to distortion in the results of existing methods. The proposed modulation network effectively addresses this issue by emphasizing the semantic structure of sketches. Consequently, the generated images achieve a
1. The authors state that because a sketch can be abstract and ambiguous, their method focuses on extracting its semantic and structure information. In that case, it raise the question of whether an alternative approach, such as performing sketch captioning first and then feeding the resulting text into a standard T2I model (or baseline methods used in this paper), could be viable. A discussion or comparison against such a two-stage pipeline would be a valuable addition. 2. All experiments are c
The method handles abstract and deformable sketches for the scene-level sketch-to-photo generation. Attention supervision explicitly ties language tokens to spatial regions and the modulation head is lightweight and only active in early timesteps, which keeps computation modest. It plugs into a standard SD2.1 pipeline without re-architecting. This modularity makes the approach easy to reproduce.
The paper does not offer a mechanistic explanation or theoritical analysis for why the noise-modulation head works. Evidence is largely empirical like metric tables and ablations without probing the internal reasons. Moreover, it does not clearly position the method against mainstream fine-tuning techniques such as LoRA comparison and the core distinction from other fine-tuning techniques remains under-analyzed. The paper does not clearly articulate the mechanistic between the noise-modulation
* Both theoretically and empirically, I found the attention supervision loss to be quite elegant and effective. I find this to be among the paper’s stronger contributions and a key step towards translating freehand sketches to semantically coherent images. * As with any good vision paper, I found the ablation experiments and comparisons to existing methods quite thorough and generally compelling. * While I do have issues with the general clarity of the methods overall, I found some of the overvi
* The paper is at times hard to read because of its structure. For example, the modulation network is introduced before the reader knows what the sketch features that it uses are. I recommend laying out the necessary details of all the ‘ingredients’ of a module before diving into details about the module. * I think there needs to be more details of the user study in the main text. I also find 23 participants to be quite a small pool. * While I do find the interleaved discussion of the results in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications
