CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large   Language Models

Yeyuan Wang; Dehong Gao; Bin Li; Rujiao Long; Lei Yi; Xiaoyan Cai,; Libin Yang; Jinxia Zhang; Shanqing Yu; Qi Xuan

arXiv:2412.16869·cs.CV·December 24, 2024

CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models

Yeyuan Wang, Dehong Gao, Bin Li, Rujiao Long, Lei Yi, Xiaoyan Cai,, Libin Yang, Jinxia Zhang, Shanqing Yu, Qi Xuan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-stage coarse-to-fine approach for multi-modal large language models to improve fine-grained visual understanding by focusing on relevant image regions, significantly enhancing performance.

Contribution

The paper proposes a novel two-stage CoF method that improves visual grounding and fine-grained comprehension in multi-modal LLMs through prompt engineering and attention adjustment.

Findings

01

Significant performance boost on baseline models

02

Enhanced visual grounding and regional focus

03

Good generalization across tasks

Abstract

The impressive performance of Large Language Model (LLM) has prompted researchers to develop Multi-modal LLM (MLLM), which has shown great potential for various multi-modal tasks. However, current MLLM often struggles to effectively address fine-grained multi-modal challenges. We argue that this limitation is closely linked to the models' visual grounding capabilities. The restricted spatial awareness and perceptual acuity of visual encoders frequently lead to interference from irrelevant background information in images, causing the models to overlook subtle but crucial details. As a result, achieving fine-grained regional visual comprehension becomes difficult. In this paper, we break down multi-modal understanding into two stages, from Coarse to Fine (CoF). In the first stage, we prompt the MLLM to locate the approximate area of the answer. In the second stage, we further enhance the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gavin001201/cof
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Focus