Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang, Wang, Rong Jin

TL;DR
This paper introduces SliME, a novel high-resolution multimodal model that efficiently balances global and local image processing, leading to improved performance with less data and computational cost.
Contribution
It proposes a new framework with an optimized training strategy and a learnable token selection method, advancing high-resolution multimodal modeling.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Utilizing fewer, more informative tokens improves performance.
Alternating training enhances global and local feature learning.
Abstract
Seeing clearly with high resolution is a foundation of Large Multimodal Models (LMMs), which has been proven to be vital for visual perception and reasoning. Existing works usually employ a straightforward resolution upscaling method, where the image consists of global and local branches, with the latter being the sliced image patches but resized to the same resolution as the former. This means that higher resolution requires more local patches, resulting in exorbitant computational expenses, and meanwhile, the dominance of local image tokens may diminish the global context. In this paper, we dive into the problems and propose a new framework as well as an elaborate optimization strategy. Specifically, we extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. With regard to local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications
