Efficient Large Multi-modal Models via Visual Context Compression
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan, Yuille

TL;DR
This paper introduces a novel method for compressing visual tokens in multi-modal large language models, significantly reducing computational costs while maintaining high performance in image and video understanding tasks.
Contribution
The paper proposes Visual Context Compressor and LLaVolta, a staged training scheme, to efficiently compress visual tokens without performance loss in multi-modal models.
Findings
Up to 70% visual token reduction with only 3% accuracy loss
Enhanced training efficiency and inference speed in MLLMs
Improved performance in image-language and video-language tasks
Abstract
While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsAverage Pooling
