LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Kechen Fang; Yihua Qin; Chongyi Wang; Wenshuo Ma; Tianyu Yu; Yuan Yao

arXiv:2605.08985·cs.CV·May 12, 2026

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Kechen Fang, Yihua Qin, Chongyi Wang, Wenshuo Ma, Tianyu Yu, Yuan Yao

PDF

1 Repo 19 Models

TL;DR

LLaVA-UHD v4 introduces a slice-based encoding and intra-ViT early compression to significantly reduce visual-encoding FLOPs in high-resolution MLLMs while maintaining or improving performance.

Contribution

It proposes a novel intra-ViT early compression method combined with slice-based encoding, enhancing efficiency in high-resolution visual encoding for MLLMs.

Findings

01

Reduces visual-encoding FLOPs by 55.8%

02

Outperforms global encoding in benchmarks

03

Maintains or surpasses baseline performance

Abstract

Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thumai-lab/LLaVA-UHD-v4
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.