Contrastive Grouping with Transformer for Referring Image Segmentation
Jiajin Tang, Ge Zheng, Cheng Shi, Sibei Yang

TL;DR
This paper introduces CGFormer, a transformer-based framework for referring image segmentation that explicitly models object-level information through token-based querying and grouping, improving segmentation accuracy.
Contribution
The paper proposes a novel mask classification framework with object-aware token querying, grouping, and contrastive learning for better referring image segmentation.
Findings
Outperforms state-of-the-art methods in segmentation accuracy
Demonstrates strong generalization capabilities
Effectively captures object-level information
Abstract
Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves cross-level interaction by jointly updating the query tokens and decoding masks in every two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Label Smoothing · Dropout · Absolute Position Encodings · Layer Normalization · Adam
