TL;DR
INF-LLaVA is a novel multimodal large language model that effectively processes high-resolution images by capturing both local details and global context through dual-perspective modules, outperforming existing models.
Contribution
The paper introduces INF-LLaVA, which employs dual-perspective cropping and enhancement modules to improve high-resolution image perception in multimodal models.
Findings
Outperforms existing MLLMs on multiple benchmarks
Effective high-resolution image processing with dual-perspective modules
Validated through extensive ablation studies
Abstract
With advancements in data availability and computing resources, Multimodal Large Language Models (MLLMs) have showcased capabilities across various fields. However, the quadratic complexity of the vision encoder in MLLMs constrains the resolution of input images. Most current approaches mitigate this issue by cropping high-resolution images into smaller sub-images, which are then processed independently by the vision encoder. Despite capturing sufficient local details, these sub-images lack global context and fail to interact with one another. To address this limitation, we propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. INF-LLaVA incorporates two innovative components. First, we introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective and comprehensive information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
