VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye; Yukang Gan; Xiaoke Huang; Yixiao Ge; Yansong Tang

arXiv:2406.12275·cs.CV·March 4, 2025

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Yansong Tang

PDF

Open Access 1 Repo 2 Datasets

TL;DR

VoCo-LLaMA introduces a novel vision token compression method using large language models, significantly reducing computational costs while maintaining performance, and effectively understanding temporal correlations in videos.

Contribution

It is the first approach to compress vision tokens with LLMs, leveraging attention distillation to improve efficiency and temporal understanding in multi-modal tasks.

Findings

01

Achieves 576× compression ratio with minimal performance loss.

02

Reduces inference FLOPs by up to 94.8%.

03

Outperforms previous methods on video question-answering benchmarks.

Abstract

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yxxxb/VoCo-LLaMA
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need