Stable Diffusion Models are Secretly Good at Visual In-Context Learning

Trevine Oorloff; Vishwanath Sindagi; Wele Gedara Chaminda Bandara; Ali Shafahi; Amin Ghiasi; Charan Prakash; Reza Ardekani

arXiv:2508.09949·cs.CV·August 14, 2025

Stable Diffusion Models are Secretly Good at Visual In-Context Learning

Trevine Oorloff, Vishwanath Sindagi, Wele Gedara Chaminda Bandara, Ali Shafahi, Amin Ghiasi, Charan Prakash, Reza Ardekani

PDF

3 Reviews

TL;DR

This paper demonstrates that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning across multiple tasks without additional training, by modifying attention mechanisms to incorporate context.

Contribution

The authors introduce a novel in-place attention re-computation in Stable Diffusion that enables effective visual in-context learning without fine-tuning.

Findings

01

Improves foreground segmentation mIoU by 8.9% on Pascal-5i.

02

Effectively leverages multiple prompts for better task inference.

03

Achieves competitive results across six diverse vision tasks.

Abstract

Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) -- the ability to leverage a few sets of example prompts to adapt to various tasks without having to explicitly update the model weights. ICL has recently been explored for computer vision tasks with promising early outcomes. These approaches involve specialized training and/or additional data that complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this repurposed Stable Diffusion…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

The paper shows an interesting observation, that stable diffusion models, like foundational LLMs are good at in-context learning, with subtle design choices and engineering. It showed better in-context learning compared to ImProv and VisualPrompting, although in fairness, they were using a weaker VQGAN models, which were not trained over vast data like diffusion. One key advantage of the proposed approach though, is exploiting the emergent properties of Diffusion Models rather than having to tr

Weaknesses

Although the paper highlights an interesting emergent property of the diffusion model, my main concern is the lack of technical contributions. Diffusion models, in my opinion, inherently outperform VQGANs, as they have been trained on vast datasets and have shown excellent performance across various unsupervised tasks, such as keypoint detection (Hedlin et al., CVPR 2024), classification (Li et al., CVPR 2023), and segmentation (Tian et al., CVPR 2024). Additionally, using distinct key-query and

Reviewer 02Rating 6Confidence 3

Strengths

- **Originality:** The proposed traning-free visual in-context learning method is is highly innovative. It proposes a re-purpose technique on the self-attention layers of SD. It integrates attention map contrasting, swap-guidance, and AdaIn mechanisms to enhance prediction quality. While these techniques are inspired by existing work, they are effectively incorporated into the overall framework, contributing to the novelty of the approach. - **Quality:** The primary innovative technique, in-plac

Weaknesses

- **Comparison methods:** This work compares the proposed method against only two existing approaches, which limits the strength of the comparative analysis. Incorporating additional methods (on different tasks) for comparison would enhance the validity and robustness of the results.

Reviewer 03Rating 5Confidence 5

Strengths

1. The proposed method to perform the in-context inference is quite reasonable. 2. The studied topic is quite interesting, and I believe visual in-context learning is a critical issue. 3. The overall demonstration is good.

Weaknesses

My major concern is the experimental parts, where I believe there many experiments shall be added. 1. The compared baseline methods are not enough. The author only made comparison with IMProv and MQ-VAE, How about SegGPT, Painter, and LVM? Besides, the results without specifically retrieval process are required to demonstrate the overall performance of the proposed methods. After all, this paper does not target on the design of best-demonstration selection. Finally, the selection-based methods

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.