FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Qu\'etu,, Shuai Xiao, Enzo Tartaglione

TL;DR
FOLDER is a plug-and-play module that reduces visual token sequence length in multi-modal large language models, significantly accelerating inference and training while maintaining or improving performance by removing up to 70% of visual tokens.
Contribution
The paper introduces FOLDER, a novel token reduction method that preserves key information and accelerates multi-modal large language models without sacrificing accuracy.
Findings
FOLDER reduces visual tokens by up to 70%.
Models with FOLDER achieve comparable or better performance.
FOLDER accelerates inference and training processes.
Abstract
Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
