SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation

Siyong Jian; Huan Wang

arXiv:2510.18716·cs.CV·October 22, 2025

SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation

Siyong Jian, Huan Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel KV cache compression framework for autoregressive image generation that decouples attention heads based on spatial and semantic properties, significantly reducing memory and computation while maintaining image quality.

Contribution

It proposes a new attention head decoupling method leveraging spatial locality and semantic sink phenomena, enabling efficient autoregressive image generation.

Findings

01

Achieves 5× memory reduction

02

Realizes 6.6× throughput speedup

03

Maintains high image quality with minimal loss

Abstract

Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- Originality: The paper's primary strength is its originality. Instead of merely adapting language-based KV compression, it presents a new, empirically-grounded understanding of attention mechanisms in visual AR models. The identification of the "spatial-semantic dichotomy" and the "margin column anchoring" phenomenon is a novel and significant finding. - Quality: The work is of good quality, with strong empirical validation for its claims.Notably, Figure 2(b) provides exceptionally clear and

Weaknesses

- Generalizability: The paper's primary weakness is the limited scope of its validation. All analyses and experiments are conducted exclusively on the Janus-Pro model family. It remains unclear whether the core findings—the spatial-semantic dichotomy and margin column anchoring—are fundamental properties of visual AR generation or emergent properties specific to the Janus-Pro architecture. The claim needs to be validated on other visual AR models to be considered general. - Static Head Classifi

Reviewer 02Rating 2Confidence 5

Strengths

1. The proposed methods are simple and effective. 2. The paper is easy to follow.

Weaknesses

1. The experimental evaluation is conducted exclusively using Janus-Pro models. To fully establish the robustness and general applicability of the proposed methods, validation across a broader range of model architectures is necessary. 2. The concept of exploiting spatial locality to accelerate autoregressive (AR) image generation has been widely adopted in methods such as PAR [1], ZipAR [2], and NAR [3]. These works, which also employ parallel decoding by restricting the attention window, are h

Reviewer 03Rating 4Confidence 4

Strengths

- The paper is well-written, intuitive, and easy to understand. - This is the first work that tries to analyze characteristics of KV-cache in AR image models, and found interesting attention patterns (spatial and semantic). This observation aligns well with intuition. - Also, this paper propose intuitive KV cache compression methods tailored for two distinct attention types.

Weaknesses

- **Limited Generalizability** : All experiments were conducted solely on the Janus-Pro model. It is uncertain whether the paper's findings, including the observed attention patterns and the efficacy of the proposed compression methods, generalize to other AR image generation models. Experiments on other AR image models, such as llamaGen, Emu3, Anole, and Lumina-mGPT (1, 2), are necessary. I believe experiments on llamaGen are essential, and additional validation on Lumina-mGPT would be welcome.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Image Enhancement Techniques