LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun; Yichen Zhang; Haolin Song; Zonghao Guo; Chi Chen; Yidan Zhang; Yuan Yao; Zhiyuan Liu; Maosong Sun

arXiv:2511.21150·cs.CV·November 27, 2025

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen, Yidan Zhang, Yuan Yao, Zhiyuan Liu, Maosong Sun

PDF

Open Access 1 Models 1 Datasets

TL;DR

LLaVA-UHD v3 introduces a progressive visual compression method that enables efficient native-resolution encoding in multi-modal large language models, balancing performance and computational efficiency.

Contribution

The paper proposes a novel Progressive Visual Compression technique integrated into Vision Transformers, significantly reducing computation while maintaining competitive performance.

Findings

01

Reduces time-to-first-token by 2.4x compared to baseline.

02

Achieves performance comparable to state-of-the-art models like Qwen2-VL.

03

Demonstrates effective visual encoding at native resolution with lower computational cost.

Abstract

Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Sishxo/LLaVA-UHD-v3
model· 20 dl· ♡ 2
20 dl♡ 2

Datasets

ZzzHelloWorld/LLaVA-UHD-v3_Pilot_experiment
dataset· 607 dl
607 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques