Precise Parameter Localization for Textual Generation in Diffusion Models
{\L}ukasz Staniszewski, Bartosz Cywi\'nski, Franziska Boenisch, Kamil Deja, Adam Dziedzic

TL;DR
This paper identifies that less than 1% of diffusion model parameters influence textual content in generated images and leverages this for efficient text localization, editing, and safety improvements across various architectures.
Contribution
The authors propose a method to localize and target specific attention layers responsible for textual content in diffusion models, enabling efficient fine-tuning, editing, and safety applications.
Findings
Localization of less than 1% of parameters influences text generation
Localized fine-tuning improves text generation capabilities
Application of localization for editing and safety in generated images
Abstract
Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than % of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit…
Peer Reviews
Decision·ICLR 2025 Poster
- By careful experimental analysis, this paper localized a small subset of cross and joint attention layers in diffusion models that are responsible for textual content generation. - Based on this observation, this paper developed an effective fine-tuning strategy for enhanced text generation while maintaining the model’s overall generation performance.
For the application of preventing the generation of toxic text within images, this paper claimed that their approach is able to remove the toxic text in the text prompt from the generated images. They achieved this target by first detecting the toxic text in the text prompt using state-of-the-art large language models, and then replacing the toxic text with non-harmful text for image generation using their approach. However, it seems that directly replacing the toxic text with non-harmful text a
Distribution of image samples with text may be quite different from image samples without text. As a result, we may face a trade-off between image quality and text-rendering ability in practice when training text-to-image generation model. Thus the idea of localizing corresponding layers for text rendering ability in text-to-image generation models is interesting. If we can localize such layers, then we can use carefully designed fine-tuning so that the resulting model performs well in both ima
The fine-tuning experiment is conducted on a small subset of MARIO-10M dataset. So it is expected that fine-tuning the whole model may lead to overfitting. The experiment results can show fine-tuning localized layers indeed works, but it can not show that fine-tuning only localized layers is better than fine-tuning the whole model. To illustrate the effects of the localized layers, it is suggested to conduct the experiment on large-scale dataset, i.e. fine-tuning the localized layers or whole mo
[Originality] The finding about the function of cross-attention layers in generating visual text is quite novel and interesting. [Quality] The authors did rigorous experiments to show that only a few layers are responsible for generating visual texts with multiple experiment setups. Additionally, the authors show that these layers only focus on these visual texts instead of generating other content described by the text prompts. The applications of improving text generation, editing, and prev
I did not find significant weaknesses in the paper. However, the authors could consider the following feedback to improve the readiness and clarity of the paper: 1. Although I'm fairly familiar with the architectures of SD3 and SDXL, I still need to guess how localizing by patching is different from injection. So, for SD3, are only the keys and values corresponding to text embeddings swapped by the target prompt caching? 2. There are many papers in text-to-image generation/editing working on
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
