OneRef: Unified One-tower Expression Grounding and Segmentation with   Mask Referring Modeling

Linhui Xiao; Xiaoshan Yang; Fang Peng; Yaowei Wang; Changsheng Xu

arXiv:2410.08021·cs.CV·October 28, 2024·2 cites

OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu

PDF

Open Access 2 Repos 3 Models 1 Video

TL;DR

OneRef introduces a unified one-tower transformer framework with Mask Referring Modeling, enabling direct and efficient visual-language grounding and segmentation, surpassing existing methods in performance.

Contribution

The paper proposes a minimalist, unified one-tower transformer architecture with a novel Mask Referring Modeling paradigm for improved referring tasks.

Findings

01

Achieves state-of-the-art results on grounding tasks

02

Outperforms existing methods in segmentation accuracy

03

Demonstrates effectiveness of referential-aware masking strategy

Abstract

Constrained by the separate encoding of vision and language, existing grounding and referring segmentation works heavily rely on bulky Transformer-based fusion en-/decoders and a variety of early-stage interaction technologies. Simultaneously, the current mask visual language modeling (MVLM) fails to capture the nuanced referential relationship between image-text in referring tasks. In this paper, we propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer that unifies the visual and linguistic feature spaces. To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM), which encompasses both referring-aware mask image modeling and referring-aware mask language modeling. Both modules not only reconstruct modality-related content but also cross-modal referring content. Within MRefM, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems

MethodsAttentive Walk-Aggregating Graph Neural Network