Efficient Multi-modal Large Language Models via Visual Token Grouping
Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng,, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng

TL;DR
This paper introduces VisToG, a grouping mechanism for multi-modal large language models that reduces inference time by over 27% while maintaining high performance, by efficiently compressing visual tokens using pre-trained vision encoders.
Contribution
VisToG is a novel visual token grouping method that leverages pre-trained vision encoders to reduce computational costs without segmentation masks.
Findings
Maintains 98.1% of original performance.
Reduces inference time by over 27%.
Effectively compresses visual tokens in MLLMs.
Abstract
The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
* Despite reducing computational demands, the method maintains 98.1% of the original performance, indicating that it is highly efficient without compromising on accuracy. * By reducing the number of visual tokens processed by the model, the method is more scalable and flexible than original MLLMs.
* My main concern is the novelty of the proposed method. The use of clustering algorithms or q-formers to reduce the number of vision tokens fed into LLMs has been examined in several previous works, including Chat-UniVi [1]. Additionally, the concept of isolated attention is not novel. I recommend that the authors provide a more in-depth analysis of the proposed method to strengthen the paper. * To demonstrate the generalizability of the proposed method, I suggest that the authors validate it a
1. The overall design is technically sound, which can be easily implemented. 2. This paper offers a clear background on why we need vision token compression in vision-language models.
1. The overall pipeline is highly similar with Q-Former based vision token compression methods. After going though the paper, I feel the only differences are two particular designs: - The first one is encouraging the vision encoder to initialize the query tokens (i.e., the group tokens in this paper) by learning some tokens to abstract the vision information in patch tokens of the vision encoder; - The second one is using Gumbel-SoftMax based operation to calculate the Q-K similarity, resultin
The method has a certain level of innovation.
More comparative experiments on visual feature compression under the same experimental settings are needed.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsSoftmax · Attention Is All You Need
