TL;DR
SEGA is a training-free, adaptive attention scaling method for diffusion transformers that enhances high-resolution image synthesis by dynamically adjusting attention based on spatial-frequency content.
Contribution
SEGA introduces a novel, content-aware, frequency-guided attention scaling technique that improves resolution extrapolation without additional training.
Findings
SEGA outperforms existing training-free methods in high-resolution image generation.
It improves structural coherence and fine detail fidelity across multiple resolutions.
Experimental results demonstrate consistent performance gains over state-of-the-art baselines.
Abstract
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
