Cross-aware Early Fusion with Stage-divided Vision and Language   Transformer Encoders for Referring Image Segmentation

Yubin Cho; Hyunwoo Yu; Suk-ju Kang

arXiv:2408.07539·cs.CV·August 15, 2024

Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation

Yubin Cho, Hyunwoo Yu, Suk-ju Kang

PDF

TL;DR

This paper introduces CrossVLT, a novel vision-language transformer architecture that enhances referring image segmentation by enabling mutual early fusion and multi-level cross-modal alignment, leading to superior performance.

Contribution

The paper proposes a stage-divided vision and language transformer with mutual early fusion and multi-level feature alignment for improved referring segmentation.

Findings

01

Outperforms previous state-of-the-art on three benchmarks.

02

Enables mutual cross-modal information exchange at all encoder stages.

03

Effective cross-modal fusion through multi-level feature alignment.

Abstract

Referring segmentation aims to segment a target object related to a natural language expression. Key challenges of this task are understanding the meaning of complex and ambiguous language expressions and determining the relevant regions in the image with multiple objects by referring to the expression. Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder, but these approaches have a limitation that the language features cannot refer to the visual information. To address this issue, this paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders (CrossVLT), which allows both language and vision encoders to perform the early fusion for improving the ability of the cross-modal context modeling. Unlike previous methods, our method enables the vision and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections