GLaMM: Pixel Grounding Large Multimodal Model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman, Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan, Yang, Fahad S. Khan

TL;DR
GLaMM is a novel multimodal model capable of generating natural language responses with dense pixel-wise object grounding, supporting flexible interaction through visual and textual prompts, and is evaluated on a new large-scale grounded conversation dataset.
Contribution
Introduces GLaMM, the first model to generate grounded language responses with dense pixel-level object segmentation and flexible multimodal prompts, along with a new benchmark dataset.
Findings
GLaMM effectively grounds objects in natural language responses.
The model performs well on downstream tasks like segmentation and captioning.
Proposed dataset enables large-scale evaluation of grounded conversation generation.
Abstract
Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MBZUAI/GLaMM-RegCap-VGmodel· 7 dl7 dl
- 🤗MBZUAI/GLaMM-RegCap-RefCOCOgmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗MBZUAI/GLaMM-GCGmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗MBZUAI/GLaMM-FullScopemodel· 360 dl· ♡ 7360 dl♡ 7
- 🤗MBZUAI/GLaMM-RefSegmodel· 27 dl· ♡ 127 dl♡ 1
- 🤗MBZUAI/GLaMM-GranD-Pretrainedmodel· 452 dl· ♡ 4452 dl♡ 4
- 🤗MBZUAI/GLaMM-FullScope_v0model· 6 dl6 dl
- 🤗linhuixiao/Awesome-Visual-Groundingmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Speech and dialogue systems
