Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Yi-Fan Zhang; Qingsong Wen; Chaoyou Fu; Xue Wang; Zhang Zhang; Liang; Wang; Rong Jin

arXiv:2406.08487·cs.CV·June 17, 2024

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang, Wang, Rong Jin

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces SliME, a novel high-resolution multimodal model that efficiently balances global and local image processing, leading to improved performance with less data and computational cost.

Contribution

It proposes a new framework with an optimized training strategy and a learnable token selection method, advancing high-resolution multimodal modeling.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Utilizing fewer, more informative tokens improves performance.

03

Alternating training enhances global and local feature learning.

Abstract

Seeing clearly with high resolution is a foundation of Large Multimodal Models (LMMs), which has been proven to be vital for visual perception and reasoning. Existing works usually employ a straightforward resolution upscaling method, where the image consists of global and local branches, with the latter being the sliced image patches but resized to the same resolution as the former. This means that higher resolution requires more local patches, resulting in exorbitant computational expenses, and meanwhile, the dominance of local image tokens may diminish the global context. In this paper, we dive into the problems and propose a new framework as well as an elaborate optimization strategy. Specifically, we extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. With regard to local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yfzhang114/slime
pytorchOfficial

Models

Datasets

yifanzhang114/SMR
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Computational Techniques and Applications