LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image   Information

Ke Wang; Hong Xuan

arXiv:2412.08771·cs.CV·December 13, 2024

LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information

Ke Wang, Hong Xuan

PDF

Open Access

TL;DR

This paper introduces DFMR, a dynamic visual token compression method for LLaVA-1.5, significantly enhancing multi-image and video processing capabilities in resource-limited settings by reducing visual token load.

Contribution

The paper proposes DFMR, a novel dynamic compression technique that adapts visual token size in LLaVA-1.5, enabling efficient multi-image and video handling without increased computational costs.

Findings

01

DFMR improves LLaVA-1.5 performance across varied visual token lengths.

02

The method enables multi-image and video processing in resource-constrained environments.

03

DFMR can be used for data augmentation in industry applications.

Abstract

Multi-modal large language models (MLLMs) utilizing instruction-following data, such as LLaVA, have achieved great progress in the industry. A major limitation in these models is that visual tokens consume a substantial portion of the maximum token limit in large language models (LLMs), leading to increased computational demands and decreased performance when prompts include multiple images or videos. Industry solutions often mitigate this issue by increasing computational power, but this approach is less feasible in academic environments with limited resources. In this study, we propose Dynamic Feature Map Reduction (DFMR) based on LLaVA-1.5 to address the challenge of visual token overload. DFMR dynamically compresses the visual tokens, freeing up token capacity. Our experimental results demonstrate that integrating DFMR into LLaVA-1.5 significantly improves the performance of LLaVA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Advanced Vision and Imaging · Video Analysis and Summarization