GroundingGPT:Language Enhanced Multi-modal Grounding Model
Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou,, Junting Pan, Zefeng Li, Van Tu Vu, Zhida Huang, Tao Wang

TL;DR
GroundingGPT is a multi-modal model that enhances fine-grained understanding of local information across modalities, improving tasks like precise localization in images and videos through a new dataset and training pipeline.
Contribution
The paper introduces GroundingGPT, a novel language-enhanced multi-modal grounding model that focuses on local information perception, supported by a diversified multi-granularity dataset.
Findings
Effective at localizing specific regions in images.
Demonstrates precise moment detection in videos.
Outperforms existing models in fine-grained tasks.
Abstract
Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose GroundingGPT, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
