Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation
Jungwon Park, Jungmin Ko, Dongnam Byun, Wonjong Rhee

TL;DR
This paper proposes a method to improve visual interpretability of text-to-image models by selectively aggregating attention maps from relevant heads, leading to better segmentation and diagnosis of prompt issues.
Contribution
It introduces a head selection approach for aggregating attention maps, enhancing interpretability and control in diffusion-based T2I models.
Findings
Selective aggregation yields higher mean IoU scores than baseline methods.
Most relevant attention heads better capture concept-specific features.
Selective aggregation aids in diagnosing prompt misinterpretations.
Abstract
Numerous studies on text-to-image (T2I) generative models have utilized cross-attention maps to boost application performance and interpret model behavior. However, the distinct characteristics of attention maps from different attention heads remain relatively underexplored. In this study, we show that selectively aggregating cross-attention maps from heads most relevant to a target concept can improve visual interpretability. Compared to the diffusion-based segmentation method DAAM, our approach achieves higher mean IoU scores. We also find that the most relevant heads capture concept-specific features more accurately than the least relevant ones, and that selective aggregation helps diagnose prompt misinterpretations. These findings suggest that attention head selection offers a promising direction for improving the interpretability and controllability of T2I generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
