Importance-Based Token Merging for Efficient Image and Video Generation
Haoyu Wu, Jingyi Xu, Hieu Le, Dimitris Samaras

TL;DR
This paper introduces an importance-based token merging technique that enhances the efficiency and quality of image and video generation by focusing on preserving high-information tokens during the merging process.
Contribution
The paper proposes a novel importance-based token merging method that prioritizes critical tokens, significantly improving generation quality and efficiency across multiple vision tasks.
Findings
Outperforms baseline methods in various applications
Enhances detail and realism in generated images and videos
Works with multiple model architectures
Abstract
Token merging can effectively accelerate various vision systems by processing groups of similar tokens only once and sharing the results across them. However, existing token grouping methods are often ad hoc and random, disregarding the actual content of the samples. We show that preserving high-information tokens during merging - those essential for semantic fidelity and structural details - significantly improves sample quality, producing finer details and more coherent, realistic generations. Despite being simple and intuitive, this approach remains underexplored. To do so, we propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation, leveraging readily available importance scores, such as those from classifier-free guidance in diffusion models. Experiments show that our approach significantly outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccess Control and Trust
MethodsDiffusion · High-Order Consensuses
