LFTR: Learning-Free Token Reduction for Multimodal Large Language Models
Zihui Zhao, Yingxin Li, Yang Li

TL;DR
LFTR is a learning-free method that reduces visual tokens in multimodal large language models, significantly decreasing computational load without retraining, and improves efficiency in vision question-answering tasks.
Contribution
LFTR introduces a novel, learning-free token reduction technique that seamlessly integrates into existing MLLMs, reducing tokens and computational costs without additional training.
Findings
Achieves up to 16x reduction in visual tokens
Maintains or improves performance on vision question-answering benchmarks
Complementary to other acceleration methods
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision modality typically contains more comprehensive information than the text modality, resulting in encoded representations comprising an extensive number of tokens, leading to significant computational overhead due to the quadratic complexity of the attention mechanism. Current token reduction methods are typically restricted to specific model architectures and often necessitate extensive retraining or fine-tuning, restricting their applicability to many state-of-the-art models. In this paper, we introduce a learning-free token reduction (LFTR) method designed for MLLMs. LFTR can be seamlessly integrated into most open-source MLLM architectures without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Metallurgy and Material Forming
MethodsFocus
