GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed; Muhammad Maaz; Sahal Shaji Mullappilly; Abdelrahman; Shaker; Salman Khan; Hisham Cholakkal; Rao M. Anwer; Erix Xing; Ming-Hsuan; Yang; Fahad S. Khan

arXiv:2311.03356·cs.CV·June 4, 2024·6 cites

GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman, Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan, Yang, Fahad S. Khan

PDF

Open Access 1 Repo 8 Models 2 Datasets

TL;DR

GLaMM is a novel multimodal model capable of generating natural language responses with dense pixel-wise object grounding, supporting flexible interaction through visual and textual prompts, and is evaluated on a new large-scale grounded conversation dataset.

Contribution

Introduces GLaMM, the first model to generate grounded language responses with dense pixel-level object segmentation and flexible multimodal prompts, along with a new benchmark dataset.

Findings

01

GLaMM effectively grounds objects in natural language responses.

02

The model performs well on downstream tasks like segmentation and captioning.

03

Proposed dataset enables large-scale evaluation of grounded conversation generation.

Abstract

Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mbzuai-oryx/groundingLMM
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Speech and dialogue systems