QuickMerge++: Fast Token Merging with Autoregressive Prior
Dong Liu, Yanxuan Yu

TL;DR
QuickMerge++ is a novel token merging method that dynamically reduces token counts during autoregressive generation, significantly improving computational efficiency while maintaining or enhancing prediction accuracy across multiple modalities.
Contribution
It introduces a dynamic, attention-based token merging framework with an autoregressive prior, enabling efficient generation without static or modality-specific token selection methods.
Findings
Reduces token counts substantially across domains.
Matches or exceeds performance of learned tokenizers.
Improves compute-accuracy tradeoffs in multi-modality tasks.
Abstract
As generative models scale to larger inputs across language, vision, and video domains, the cost of token-level computation has become a key bottleneck. While prior work suggests that only a subset of tokens significantly influence downstream predictions, most token selection methods are static, modality-specific, or incompatible with autoregressive generation. In this paper, we propose QuickMerge, a lightweight token merging framework designed for efficient next-token prediction. QuickMerge dynamically selects a reduced number of tokens based on attention norm magnitude, guided by an entropy-based budget estimator. To preserve autoregressive compatibility, we introduce a lightweight transformer prior trained over the merged token sequence. By combining semantic salience estimation, flexible token budgets, and AR alignment, QuickMerge enables accurate generation with fewer tokens.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
