Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation

Xunzhi Xiang; Qi Fan

arXiv:2506.18226·cs.CV·June 24, 2025

Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation

Xunzhi Xiang, Qi Fan

PDF

4 Reviews

TL;DR

This paper introduces ADSA, a training-free dynamic sparse attention method that improves efficiency and reduces memory usage in autoregressive image generation without sacrificing quality.

Contribution

It proposes a novel dynamic sparse attention mechanism and KV-cache update for more efficient inference in autoregressive image models.

Findings

01

Reduces GPU memory consumption by approximately 50%

02

Maintains high generation quality with improved efficiency

03

Outperforms existing methods in qualitative and quantitative metrics

Abstract

Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper addresses an important issue of computational efficiency in autoregressive image generation. 2. It provides insightful analyses of autoregressive generation mechanisms, highlighting the distinct roles of prefix, previous, and local tokens in shaping generated images (Section 3). 3. The proposed ADSA method effectively reduces the required context length while maintaining high-quality image generation.

Weaknesses

1. The generality of the proposed ADSA is not fully explored. While its effectiveness is demonstrated on the LlamaGen model, further validation on other major autoregressive image generation models would strengthen the paper. 2. Although the paper presents solid analyses (Section 3) and quantitative results (Section 5), it remains unclear whether the proposed semantic-diversity-based context reduction strategy is optimal. An extended ablation study comparing it to simpler approaches (e.g., un

Reviewer 02Rating 2Confidence 4

Strengths

- **Training-free, architecture-agnostic** drop-in idea with clear intuition (prefix for style, local for texture; Fig. 6). - **Concrete formulation.** TopK-V selection by average V-similarity (Eqs. 2–5). - **Reported memory relief.** Up to ~50% shorter cache/context with near-constant FID/IS/CLIP and memory curves across batch sizes (Tables 1–2; Fig. 9). - **Simple ablation of prefix/local/previous** shows local window is critical (Table 3).

Weaknesses

1) **Novelty vs. prior dynamic/sparse attention is insufficiently isolated.** ADSA overlaps conceptually with Λ-shaped/window + selective cache ideas known in LLMs (e.g., StreamingLLM, LM-Infinite, LongHeads, Reattention, MInference, RetrievalAttention). The paper argues image tokens are high-entropy so NLP methods don’t transfer (Sec. 2), but lacks a *controlled* comparison where those baselines are adapted to LlamaGen and evaluated under the same protocol. Please add head-to-head against Zip

Reviewer 03Rating 4Confidence 3

Strengths

1. This paper is well-organized and easy to read. 2. The approach is very easy to follow.

Weaknesses

1. The major concern of this paper is whether the proposed KV cache selection method can achieve obvious inference acceleration. - For experiments, this paper only shows GPU memory usage compared with original entire context length methods. It seems that the GPU memory usage reduction is not obvious when batch size is small. However, small batch size inference is more common in real-world applications. Besides, the inference throughput is more important compared to GPU memory overhead (which is

Reviewer 04Rating 2Confidence 4

Strengths

- The motivation is clearly stated and convincing; - The general idea is straightforward and reasonable;

Weaknesses

Despite the idea being reasonable and straightforward, this work lacks both novelty and the evidence of its effectiveness: - Insignificant Novelty: The selected K-V pairs have three parts: local tokens, sink tokens, and dynamically selected ones. The former two have been proposed earlier [1] and extensive works have studied similar topics [2,3,4]. - More importantly, the only "new" technique should be the dynamic selection mechanism, because the other two are static and proposed by existing

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.