AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity

Zhibin Lan; Liqiang Niu; Fandong Meng; Wenbo Li; Jie Zhou; Jinsong Su

arXiv:2410.02745·cs.CV·August 7, 2025

AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity

Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su

PDF

Open Access 1 Repo 2 Models

TL;DR

AVG-LLaVA introduces an adaptive visual granularity mechanism in large multimodal models, improving efficiency and performance by selecting appropriate image detail levels dynamically, reducing tokens and increasing inference speed.

Contribution

It proposes a novel visual granularity router and a training paradigm RGLF, enabling adaptive image processing without manual annotations, enhancing multimodal model efficiency.

Findings

01

Achieves 85.3% reduction in visual tokens

02

Speeds up inference by 2.53 times

03

Outperforms across 11 benchmarks

Abstract

Recently, large multimodal models (LMMs) have achieved significant advancements. When dealing with high-resolution images, dominant LMMs typically divide them into multiple local images and a global image, leading to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. Specifically, we first apply the multiple pooling layers to obtain visual tokens at different granularities. Then we propose a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we put forward RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deeplearnxmu/avg-llava
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Vision and Imaging · Video Analysis and Summarization

MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Dropout · Byte Pair Encoding · Absolute Position Encodings