GroundingGPT:Language Enhanced Multi-modal Grounding Model

Zhaowei Li; Qi Xu; Dong Zhang; Hang Song; Yiqing Cai; Qi Qi; Ran Zhou,; Junting Pan; Zefeng Li; Van Tu Vu; Zhida Huang; Tao Wang

arXiv:2401.06071·cs.CV·March 6, 2024·1 cites

GroundingGPT:Language Enhanced Multi-modal Grounding Model

Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou,, Junting Pan, Zefeng Li, Van Tu Vu, Zhida Huang, Tao Wang

PDF

Open Access 2 Repos 1 Models

TL;DR

GroundingGPT is a multi-modal model that enhances fine-grained understanding of local information across modalities, improving tasks like precise localization in images and videos through a new dataset and training pipeline.

Contribution

The paper introduces GroundingGPT, a novel language-enhanced multi-modal grounding model that focuses on local information perception, supported by a diversified multi-granularity dataset.

Findings

01

Effective at localizing specific regions in images.

02

Demonstrates precise moment detection in videos.

03

Outperforms existing models in fine-grained tasks.

Abstract

Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose GroundingGPT, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
zwli/GroundingGPT
model· 38 dl· ♡ 4
38 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems