Recoverable Compression: A Multimodal Vision Token Recovery Mechanism   Guided by Text Information

Yi Chen; Jian Xu; Xu-Yao Zhang; Wen-Zhuo Liu; Yang-Yang Liu; Cheng-Lin; Liu

arXiv:2409.01179·cs.CV·December 20, 2024

Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin, Liu

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a text-guided dynamic visual token recovery method for multimodal models that maintains performance while significantly reducing visual tokens, enhancing efficiency without retraining.

Contribution

It proposes a novel, training-free token recovery mechanism guided by text similarity, improving token compression in multimodal vision-language models.

Findings

01

Achieves 90% reduction in visual tokens on average.

02

Maintains comparable performance to original models.

03

Demonstrates effectiveness across various visual tasks.

Abstract

With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information· underline

Taxonomy

TopicsAdvanced Steganography and Watermarking Techniques · Advanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques

MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings