Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin, Liu

TL;DR
This paper introduces a text-guided dynamic visual token recovery method for multimodal models that maintains performance while significantly reducing visual tokens, enhancing efficiency without retraining.
Contribution
It proposes a novel, training-free token recovery mechanism guided by text similarity, improving token compression in multimodal vision-language models.
Findings
Achieves 90% reduction in visual tokens on average.
Maintains comparable performance to original models.
Demonstrates effectiveness across various visual tasks.
Abstract
With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Advanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques
MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
