TL;DR
This paper enhances training-free diffusion-based semantic segmentation by addressing attention map discrepancies, enabling better utilization of powerful generative models for improved segmentation accuracy.
Contribution
It identifies key gaps in existing methods and proposes auto aggregation and per-pixel rescaling to improve segmentation performance without additional training.
Findings
Improved segmentation accuracy on standard benchmarks.
Effective integration with generative techniques for broader applicability.
Addresses attention map discrepancies in diffusion models.
Abstract
As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
