Gen-AI Police Sketches with Stable Diffusion
Nicholas Fidalgo, Aaron Contreras, Katherine Harvey, Johnny Ni

TL;DR
This paper explores AI-driven suspect sketching using Stable Diffusion and CLIP, comparing three models and demonstrating that a simple image-to-image approach yields the most structurally accurate sketches.
Contribution
It introduces a novel LoRA fine-tuning method for CLIP within Stable Diffusion, enhancing text-to-sketch alignment in suspect image generation.
Findings
Model 1 achieved SSIM of 0.72 and PSNR of 25 dB
Fine-tuning both self- and cross-attention layers improved alignment
Model 1 produced the clearest facial features in sketches
Abstract
This project investigates the use of multimodal AI-driven approaches to automate and enhance suspect sketching. Three pipelines were developed and evaluated: (1) baseline image-to-image Stable Diffusion model, (2) same model integrated with a pre-trained CLIP model for text-image alignment, and (3) novel approach incorporating LoRA fine-tuning of the CLIP model, applied to self-attention and cross-attention layers, and integrated with Stable Diffusion. An ablation study confirmed that fine-tuning both self- and cross-attention layers yielded the best alignment between text descriptions and sketches. Performance testing revealed that Model 1 achieved the highest structural similarity (SSIM) of 0.72 and a peak signal-to-noise ratio (PSNR) of 25 dB, outperforming Model 2 and Model 3. Iterative refinement enhanced perceptual similarity (LPIPS), with Model 3 showing improvement over Model 2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Face Recognition and Perception · Generative Adversarial Networks and Image Synthesis
