ApET: Approximation-Error Guided Token Compression for Efficient VLMs

Qiankun Ma; Ziyao Zhang; Haofei Wang; Jie Chen; Zhen Song; Hairong Zheng

arXiv:2602.19870·cs.CV·February 24, 2026

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, Hairong Zheng

PDF

Open Access

TL;DR

ApET introduces an information-theoretic, attention-free token compression method for Vision-Language Models, significantly reducing tokens while preserving performance and enabling faster inference with compatible attention kernels.

Contribution

It proposes a novel, attention-independent token compression framework based on approximation error, improving efficiency without sacrificing accuracy.

Findings

01

Retains 95.2% of original performance on image tasks

02

Achieves 100.4% performance on video tasks

03

Compresses token budgets by over 87%

Abstract

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques