Precise Parameter Localization for Textual Generation in Diffusion Models

{\L}ukasz Staniszewski; Bartosz Cywi\'nski; Franziska Boenisch; Kamil Deja; Adam Dziedzic

arXiv:2502.09935·cs.CV·March 3, 2026

Precise Parameter Localization for Textual Generation in Diffusion Models

{\L}ukasz Staniszewski, Bartosz Cywi\'nski, Franziska Boenisch, Kamil Deja, Adam Dziedzic

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper identifies that less than 1% of diffusion model parameters influence textual content in generated images and leverages this for efficient text localization, editing, and safety improvements across various architectures.

Contribution

The authors propose a method to localize and target specific attention layers responsible for textual content in diffusion models, enabling efficient fine-tuning, editing, and safety applications.

Findings

01

Localization of less than 1% of parameters influences text generation

02

Localized fine-tuning improves text generation capabilities

03

Application of localization for editing and safety in generated images

Abstract

Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than $1$ % of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

- By careful experimental analysis, this paper localized a small subset of cross and joint attention layers in diffusion models that are responsible for textual content generation. - Based on this observation, this paper developed an effective fine-tuning strategy for enhanced text generation while maintaining the model’s overall generation performance.

Weaknesses

For the application of preventing the generation of toxic text within images, this paper claimed that their approach is able to remove the toxic text in the text prompt from the generated images. They achieved this target by first detecting the toxic text in the text prompt using state-of-the-art large language models, and then replacing the toxic text with non-harmful text for image generation using their approach. However, it seems that directly replacing the toxic text with non-harmful text a

Reviewer 02Rating 6Confidence 4

Strengths

Distribution of image samples with text may be quite different from image samples without text. As a result, we may face a trade-off between image quality and text-rendering ability in practice when training text-to-image generation model. Thus the idea of localizing corresponding layers for text rendering ability in text-to-image generation models is interesting. If we can localize such layers, then we can use carefully designed fine-tuning so that the resulting model performs well in both ima

Weaknesses

The fine-tuning experiment is conducted on a small subset of MARIO-10M dataset. So it is expected that fine-tuning the whole model may lead to overfitting. The experiment results can show fine-tuning localized layers indeed works, but it can not show that fine-tuning only localized layers is better than fine-tuning the whole model. To illustrate the effects of the localized layers, it is suggested to conduct the experiment on large-scale dataset, i.e. fine-tuning the localized layers or whole mo

Reviewer 03Rating 8Confidence 4

Strengths

[Originality] The finding about the function of cross-attention layers in generating visual text is quite novel and interesting. [Quality] The authors did rigorous experiments to show that only a few layers are responsible for generating visual texts with multiple experiment setups. Additionally, the authors show that these layers only focus on these visual texts instead of generating other content described by the text prompts. The applications of improving text generation, editing, and prev

Weaknesses

I did not find significant weaknesses in the paper. However, the authors could consider the following feedback to improve the readiness and clarity of the paper: 1. Although I'm fairly familiar with the architectures of SD3 and SDXL, I still need to guess how localizing by patching is different from injection. So, for SD3, are only the keys and values corresponding to text embeddings swapped by the target prompt caching? 2. There are many papers in text-to-image generation/editing working on

Videos

Precise Parameter Localization for Textual Generation in Diffusion Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling