FILA: Fine-Grained Vision Language Models

Shiding Zhu; Wenhui Dong; Jun Song; Yingbo Wang; Yanan Guo; Bo Zheng

arXiv:2412.08378·cs.CV·May 1, 2025

FILA: Fine-Grained Vision Language Models

Shiding Zhu, Wenhui Dong, Jun Song, Yingbo Wang, Yanan Guo, Bo Zheng

PDF

Open Access

TL;DR

FILA introduces HyViLM, a novel high-resolution image processing model that enhances vision-language tasks by maintaining context and improving encoding through a hybrid encoder and optimal feature fusion, outperforming existing models.

Contribution

The paper presents HyViLM, a new visual encoder and feature fusion strategy that effectively processes high-resolution images without truncation, advancing multimodal large language models.

Findings

01

HyViLM outperforms state-of-the-art models in 9 out of 10 tasks.

02

Achieves 9.6% improvement on TextVQA.

03

Achieves 6.9% improvement on DocVQA.

Abstract

Recently, there has been growing interest in the capability of multimodal large language models (MLLMs) to process high-resolution images. A common approach currently involves dynamically cropping the original high-resolution image into smaller sub-images, which are then fed into a vision encoder that was pre-trained on lower-resolution images. However, this cropping approach often truncates objects and connected areas in the original image, causing semantic breaks. To address this limitation, we introduce HyViLM, designed to process images of any resolution while retaining the overall context during encoding. Specifically, we: (i) Design a new visual encoder called Hybrid Encoder that not only encodes individual sub-images but also interacts with detailed global visual features, significantly improving the model's ability to encode high-resolution images. (ii) Propose an optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques