Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Binglei Li; Mengping Yang; Zhiyu Tan; Junping Zhang; Hao Li

arXiv:2602.13585·cs.CV·March 2, 2026

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li

PDF

Open Access

TL;DR

Diff-Aid introduces an inference-time adaptive method for text-to-image diffusion models that enhances semantic alignment and visual quality by dynamically adjusting interactions between textual and visual features across model stages.

Contribution

It proposes a flexible, plug-and-play inference-time approach that adaptively modulates text-image interactions, improving generation quality and interpretability in diffusion models.

Findings

01

Consistent improvements in prompt adherence and visual quality.

02

Enhanced human preference scores across experiments.

03

Effective integration with downstream applications like style transfer and zero-shot editing.

Abstract

Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Digital Humanities and Scholarship