TL;DR
This paper introduces ADSA, a training-free dynamic sparse attention method that improves efficiency and reduces memory usage in autoregressive image generation without sacrificing quality.
Contribution
It proposes a novel dynamic sparse attention mechanism and KV-cache update for more efficient inference in autoregressive image models.
Findings
Reduces GPU memory consumption by approximately 50%
Maintains high generation quality with improved efficiency
Outperforms existing methods in qualitative and quantitative metrics
Abstract
Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper addresses an important issue of computational efficiency in autoregressive image generation. 2. It provides insightful analyses of autoregressive generation mechanisms, highlighting the distinct roles of prefix, previous, and local tokens in shaping generated images (Section 3). 3. The proposed ADSA method effectively reduces the required context length while maintaining high-quality image generation.
1. The generality of the proposed ADSA is not fully explored. While its effectiveness is demonstrated on the LlamaGen model, further validation on other major autoregressive image generation models would strengthen the paper. 2. Although the paper presents solid analyses (Section 3) and quantitative results (Section 5), it remains unclear whether the proposed semantic-diversity-based context reduction strategy is optimal. An extended ablation study comparing it to simpler approaches (e.g., un
- **Training-free, architecture-agnostic** drop-in idea with clear intuition (prefix for style, local for texture; Fig. 6). - **Concrete formulation.** TopK-V selection by average V-similarity (Eqs. 2–5). - **Reported memory relief.** Up to ~50% shorter cache/context with near-constant FID/IS/CLIP and memory curves across batch sizes (Tables 1–2; Fig. 9). - **Simple ablation of prefix/local/previous** shows local window is critical (Table 3).
1) **Novelty vs. prior dynamic/sparse attention is insufficiently isolated.** ADSA overlaps conceptually with Λ-shaped/window + selective cache ideas known in LLMs (e.g., StreamingLLM, LM-Infinite, LongHeads, Reattention, MInference, RetrievalAttention). The paper argues image tokens are high-entropy so NLP methods don’t transfer (Sec. 2), but lacks a *controlled* comparison where those baselines are adapted to LlamaGen and evaluated under the same protocol. Please add head-to-head against Zip
1. This paper is well-organized and easy to read. 2. The approach is very easy to follow.
1. The major concern of this paper is whether the proposed KV cache selection method can achieve obvious inference acceleration. - For experiments, this paper only shows GPU memory usage compared with original entire context length methods. It seems that the GPU memory usage reduction is not obvious when batch size is small. However, small batch size inference is more common in real-world applications. Besides, the inference throughput is more important compared to GPU memory overhead (which is
- The motivation is clearly stated and convincing; - The general idea is straightforward and reasonable;
Despite the idea being reasonable and straightforward, this work lacks both novelty and the evidence of its effectiveness: - Insignificant Novelty: The selected K-V pairs have three parts: local tokens, sink tokens, and dynamically selected ones. The former two have been proposed earlier [1] and extensive works have studied similar topics [2,3,4]. - More importantly, the only "new" technique should be the dynamic selection mechanism, because the other two are static and proposed by existing
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
