ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization
Jiayu Chen, Ruoyu Lin, Zihao Zheng, Jingxin Li, Maoliang Li, Guojie Luo, Xiang Chen

TL;DR
ToProVAR introduces a novel entropy-aware and sparsity-based optimization framework for visual autoregressive models, significantly improving generation efficiency while maintaining semantic quality.
Contribution
It proposes a new optimization method leveraging attention entropy and sparsity patterns, differing from heuristic skipping strategies in prior models.
Findings
Achieves up to 3.4x acceleration in generation speed.
Maintains semantic fidelity and detail with minimal quality loss.
Outperforms traditional methods in efficiency and quality.
Abstract
Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the…
Peer Reviews
Decision·ICLR 2026 Poster
* Integrated semantic perspective. The tri‑dimensional entropy viewpoint connects token salience, layer scope, and multi‑scale semantics, which motivates where to prune. * Layer taxonomy with a quantitative analysis. SVD‑based gives a reproducible criterion to distinguish Global/Detail layers before pruning. * Scale‑aware depth selection. The low‑entropy ratio offers a principled way to adapt depth to content complexity rather than using a fixed scale budget. * Empirical validation. The layer
I'm not an actual expert in this area, but I have some concerns about the paper (including some appendix) based on my understanding. * In terms of overhead and practicality, computing entropy per token, SVD per layer × scale, and tri‑dimensional gating online can be expensive. What is the actual (e.g., net) wall‑clock speedup vs. simpler frequency‑based methods once analysis overhead is included? * Calibration burden: Depth threshold $\tau$ is selected by pre‑sampling (in Appendix). How robust
$\bullet$ New sparsity signal: moves from frequency heuristics to entropy based semantic salience, and the results with illustrations are convincing. $\bullet$ three dimension design is well structured: starting with low entropy ratio, layer classification via principal component ratio from SVD, then unified token retention probability. $\bullet$ Nontrivial novelty in the engineering design: FAE integrates entropy into flash attention to avoid $N \times N$ instantiation and keep the linear-t
$\bullet$ The paper takes low attention-entropy as salient, but there is no proof it correlates with task loss or semantics, nor normalization across heads and layers with different logit temperatures. $\bullet$ Deriving the error bounds or optimality for the tri-stage greedy pruning: Scale -> layer -> tokens decisions are locally heuristic with no suboptimality gap, stability, or compounding-error control. I understand this work focuses on empirical contributions, but it is often necessary
1. The idea of employing attention entropy is interesting and the motivation is strong. 2. Solid experiments and promising performance.
1. It would be beneficial to include results on additional backbones beyond Infinity, such as HART [1] and STAR [2], to demonstrate the generalizability of ToProVAR across different VAR architectures. 2. The paper does not provide a detailed analysis of the computational overhead of SVD. Is there an analysis of the layer representation score? What are the patterns in the emergence of Global Layers and Detail Layers? Figure 10 shows an alternating pattern, is this behavior general? 3. According
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications
