GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia; Dongchen Han; Yizeng Han; Xuran Pan; Shiji Song; Gao; Huang

arXiv:2312.10103·cs.CV·March 22, 2024·1 cites

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao, Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces GSVA, a novel multimodal segmentation model that effectively handles complex referring expressions involving multiple objects or non-existent targets, improving performance on GRES benchmarks.

Contribution

GSVA reuses the [SEG] token for multiple references and learns a [REJ] token to explicitly reject null targets, advancing GRES capabilities.

Findings

01

GSVA achieves state-of-the-art results on gRefCOCO benchmark.

02

GSVA effectively handles multiple references and null targets.

03

Improves generalization across referring segmentation tasks.

Abstract

Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient in understanding contexts with visual inputs. Among them, LISA, as a representative, adopts a special [SEG] token to prompt a segmentation mask decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leaplabthu/gsva
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsSegment Anything Model