Contrastive Grouping with Transformer for Referring Image Segmentation

Jiajin Tang; Ge Zheng; Cheng Shi; Sibei Yang

arXiv:2309.01017·cs.CV·September 6, 2023

Contrastive Grouping with Transformer for Referring Image Segmentation

Jiajin Tang, Ge Zheng, Cheng Shi, Sibei Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces CGFormer, a transformer-based framework for referring image segmentation that explicitly models object-level information through token-based querying and grouping, improving segmentation accuracy.

Contribution

The paper proposes a novel mask classification framework with object-aware token querying, grouping, and contrastive learning for better referring image segmentation.

Findings

01

Outperforms state-of-the-art methods in segmentation accuracy

02

Demonstrates strong generalization capabilities

03

Effectively captures object-level information

Abstract

Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves cross-level interaction by jointly updating the query tokens and decoding masks in every two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

toneyaya/cgformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Label Smoothing · Dropout · Absolute Position Encodings · Layer Normalization · Adam