VIPA: Visual Informative Part Attention for Referring Image Segmentation

Yubin Cho; Hyunwoo Yu; Kyeongbo Kong; Kyomin Sohn; Bongjoon Hyun; Suk-Ju Kang

arXiv:2602.14788·cs.CV·February 17, 2026

VIPA: Visual Informative Part Attention for Referring Image Segmentation

Yubin Cho, Hyunwoo Yu, Kyeongbo Kong, Kyomin Sohn, Bongjoon Hyun, Suk-Ju Kang

PDF

Open Access

TL;DR

VIPA introduces a novel attention framework that leverages informative visual parts and a visual expression generator to improve fine-grained referring image segmentation, outperforming existing methods.

Contribution

The paper proposes VIPA, a new framework with a visual expression generator that enhances visual context exploitation for more accurate segmentation.

Findings

01

VIPA outperforms state-of-the-art on four RIS benchmarks.

02

The visual expression generator effectively reduces noise and captures semantic visual regions.

03

VIPA improves the alignment of attention with fine-grained image regions.

Abstract

Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection