Efficient Large Multi-modal Models via Visual Context Compression

Jieneng Chen; Luoxin Ye; Ju He; Zhao-Yang Wang; Daniel Khashabi; Alan; Yuille

arXiv:2406.20092·cs.CV·November 19, 2024

Efficient Large Multi-modal Models via Visual Context Compression

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan, Yuille

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method for compressing visual tokens in multi-modal large language models, significantly reducing computational costs while maintaining high performance in image and video understanding tasks.

Contribution

The paper proposes Visual Context Compressor and LLaVolta, a staged training scheme, to efficiently compress visual tokens without performance loss in multi-modal models.

Findings

01

Up to 70% visual token reduction with only 3% accuracy loss

02

Enhanced training efficiency and inference speed in MLLMs

03

Improved performance in image-language and video-language tasks

Abstract

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

beckschen/llavolta
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsAverage Pooling