Task-Related Token Compression in Multimodal Large Language Models from an Explainability Perspective

Lei Lei; Jie Gu; Xiaokang Ma; Chu Tang; Jingmin Chen; Tong Xu

arXiv:2506.01097·cs.CV·May 5, 2026

Task-Related Token Compression in Multimodal Large Language Models from an Explainability Perspective

Lei Lei, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen, Tong Xu

PDF

1 Video

TL;DR

This paper introduces a novel, task-related visual token compression method at the input stage of multimodal large language models, reducing computational costs without performance loss by leveraging explainability techniques.

Contribution

It proposes a model-agnostic, input-stage token compression approach guided by explainability methods, enabling efficient processing in MLLMs without architectural modifications.

Findings

01

Effective token compression at input stage with negligible performance loss

02

Significant reduction in inference time and memory usage

03

Strong generalization demonstrated across multiple benchmarks and models

Abstract

Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Instruction-related visual token compression demonstrates strong task relevance, which aligns well with MLLMs ultimate goal of instruction following. Previous works generally assume that visual tokens achieve better vision-language alignment in the shallow layers of LLMs, which have led to task-related token compression being primarily applied in intermediate LLM layers. In contrast, our study reveals that with proper selection, task-related token compression is feasible at the input stage of LLM with negligible performance loss. This new paradigm significantly reduces task-irrelevant visual tokens and its model-agnostic design enables application without modifying the LLM architecture. Specifically, we suggest that explainability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Task-Related Token Compression in Multimodal Large Language Models from an Explainability Perspective· slideslive