TL;DR
This paper demonstrates that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning across multiple tasks without additional training, by modifying attention mechanisms to incorporate context.
Contribution
The authors introduce a novel in-place attention re-computation in Stable Diffusion that enables effective visual in-context learning without fine-tuning.
Findings
Improves foreground segmentation mIoU by 8.9% on Pascal-5i.
Effectively leverages multiple prompts for better task inference.
Achieves competitive results across six diverse vision tasks.
Abstract
Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) -- the ability to leverage a few sets of example prompts to adapt to various tasks without having to explicitly update the model weights. ICL has recently been explored for computer vision tasks with promising early outcomes. These approaches involve specialized training and/or additional data that complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this repurposed Stable Diffusion…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper shows an interesting observation, that stable diffusion models, like foundational LLMs are good at in-context learning, with subtle design choices and engineering. It showed better in-context learning compared to ImProv and VisualPrompting, although in fairness, they were using a weaker VQGAN models, which were not trained over vast data like diffusion. One key advantage of the proposed approach though, is exploiting the emergent properties of Diffusion Models rather than having to tr
Although the paper highlights an interesting emergent property of the diffusion model, my main concern is the lack of technical contributions. Diffusion models, in my opinion, inherently outperform VQGANs, as they have been trained on vast datasets and have shown excellent performance across various unsupervised tasks, such as keypoint detection (Hedlin et al., CVPR 2024), classification (Li et al., CVPR 2023), and segmentation (Tian et al., CVPR 2024). Additionally, using distinct key-query and
- **Originality:** The proposed traning-free visual in-context learning method is is highly innovative. It proposes a re-purpose technique on the self-attention layers of SD. It integrates attention map contrasting, swap-guidance, and AdaIn mechanisms to enhance prediction quality. While these techniques are inspired by existing work, they are effectively incorporated into the overall framework, contributing to the novelty of the approach. - **Quality:** The primary innovative technique, in-plac
- **Comparison methods:** This work compares the proposed method against only two existing approaches, which limits the strength of the comparative analysis. Incorporating additional methods (on different tasks) for comparison would enhance the validity and robustness of the results.
1. The proposed method to perform the in-context inference is quite reasonable. 2. The studied topic is quite interesting, and I believe visual in-context learning is a critical issue. 3. The overall demonstration is good.
My major concern is the experimental parts, where I believe there many experiments shall be added. 1. The compared baseline methods are not enough. The author only made comparison with IMProv and MQ-VAE, How about SegGPT, Painter, and LVM? Besides, the results without specifically retrieval process are required to demonstrate the overall performance of the proposed methods. After all, this paper does not target on the design of best-demonstration selection. Finally, the selection-based methods
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
