INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal   Large Language Model

Yiwei Ma; Zhibin Wang; Xiaoshuai Sun; Weihuang Lin; Qiang Zhou; Jiayi; Ji; Rongrong Ji

arXiv:2407.16198·cs.CV·July 24, 2024

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Yiwei Ma, Zhibin Wang, Xiaoshuai Sun, Weihuang Lin, Qiang Zhou, Jiayi, Ji, Rongrong Ji

PDF

1 Repo

TL;DR

INF-LLaVA is a novel multimodal large language model that effectively processes high-resolution images by capturing both local details and global context through dual-perspective modules, outperforming existing models.

Contribution

The paper introduces INF-LLaVA, which employs dual-perspective cropping and enhancement modules to improve high-resolution image perception in multimodal models.

Findings

01

Outperforms existing MLLMs on multiple benchmarks

02

Effective high-resolution image processing with dual-perspective modules

03

Validated through extensive ablation studies

Abstract

With advancements in data availability and computing resources, Multimodal Large Language Models (MLLMs) have showcased capabilities across various fields. However, the quadratic complexity of the vision encoder in MLLMs constrains the resolution of input images. Most current approaches mitigate this issue by cropping high-resolution images into smaller sub-images, which are then processed independently by the vision encoder. Despite capturing sufficient local details, these sub-images lack global context and fail to interact with one another. To address this limitation, we propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. INF-LLaVA incorporates two innovative components. First, we introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective and comprehensive information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

weihuanglin/inf-llava
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training