TL;DR
This paper introduces a novel, task-related visual token compression method at the input stage of multimodal large language models, reducing computational costs without performance loss by leveraging explainability techniques.
Contribution
It proposes a model-agnostic, input-stage token compression approach guided by explainability methods, enabling efficient processing in MLLMs without architectural modifications.
Findings
Effective token compression at input stage with negligible performance loss
Significant reduction in inference time and memory usage
Strong generalization demonstrated across multiple benchmarks and models
Abstract
Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Instruction-related visual token compression demonstrates strong task relevance, which aligns well with MLLMs ultimate goal of instruction following. Previous works generally assume that visual tokens achieve better vision-language alignment in the shallow layers of LLMs, which have led to task-related token compression being primarily applied in intermediate LLM layers. In contrast, our study reveals that with proper selection, task-related token compression is feasible at the input stage of LLM with negligible performance loss. This new paradigm significantly reduces task-irrelevant visual tokens and its model-agnostic design enables application without modifying the LLM architecture. Specifically, we suggest that explainability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
