RegionGPT: Towards Region Understanding Vision Language Model

Qiushan Guo; Shalini De Mello; Hongxu Yin; Wonmin Byeon; Ka Chun; Cheung; Yizhou Yu; Ping Luo; Sifei Liu

arXiv:2403.02330·cs.CV·March 5, 2024·1 cites

RegionGPT: Towards Region Understanding Vision Language Model

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun, Cheung, Yizhou Yu, Ping Luo, Sifei Liu

PDF

Open Access 1 Models

TL;DR

RegionGPT is a novel vision-language model that improves detailed regional understanding and captioning by enhancing spatial awareness and using region-specific training data, enabling better performance on region-level tasks.

Contribution

The paper introduces RegionGPT, a framework that enhances spatial awareness in vision-language models and incorporates region-specific data for improved regional understanding.

Findings

01

Significantly improves performance on region-level tasks

02

Effective region caption data generation pipeline

03

Versatile application across multiple region understanding tasks

Abstract

Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions. To address this, we introduce RegionGPT (short as RGPT), a novel framework designed for complex region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We further improve performance on tasks requiring a specific output scope by integrating task-guided instruction prompts during both training and inference phases, while maintaining the model's versatility for general-purpose tasks. Additionally, we develop an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training