Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

Junbo Niu; Yuanhong Zheng; Ziyang Miao; Hejun Dong; Chunjiang Ge; Hao Liang; Ma Lu; Bohan Zeng; Qiahao Zheng; Conghui He; Wentao Zhang

arXiv:2506.12776·cs.CV·June 17, 2025

Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, Wentao Zhang

PDF

Open Access 2 Models

TL;DR

This paper introduces RC-Bench, a benchmark for evaluating vision-language models under varied resolutions, and NativeRes-LLaVA, a framework enabling models to process images at native resolutions, significantly enhancing performance.

Contribution

It presents a systematic benchmark and a training framework to address resolution challenges in vision-language models, filling gaps in evaluation and model design.

Findings

01

Native resolution encoding improves VLM performance on resolution-centric benchmarks.

02

RC-Bench effectively evaluates VLMs under diverse visual conditions.

03

NativeRes-LLaVA enables models to process images at their native resolutions.

Abstract

Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications