Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling

Chao Zhou; Tianyi Wei; Nenghai Yu

arXiv:2507.16240·cs.CV·July 23, 2025

Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling

Chao Zhou, Tianyi Wei, Nenghai Yu

PDF

Open Access

TL;DR

This paper introduces Self-Adaptive Attention Scaling (SaaS), a method to improve instruction-following accuracy in unified image generation models by dynamically adjusting attention, validated through experiments on image editing and generation tasks.

Contribution

The paper proposes SaaS, a novel attention scaling technique that enhances instruction fidelity in unified models without additional training, addressing instruction neglect issues.

Findings

01

SaaS improves instruction-following fidelity in image tasks.

02

Experimental results outperform existing methods.

03

No extra training or optimization needed.

Abstract

Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose Self-Adaptive Attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Media and Philosophy