UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo

TL;DR
UniRef++ introduces a unified architecture for multiple reference-based object segmentation tasks, enabling multi-task learning and achieving state-of-the-art results across various benchmarks.
Contribution
The paper proposes UniRef++, a single architecture with UniFusion for unified multi-task object segmentation, allowing flexible task execution and efficient training.
Findings
Achieves state-of-the-art on RIS and RVOS benchmarks.
Performs competitively on FSS and VOS with shared parameters.
Incorporates UniFusion into foundation models like SAM for efficient finetuning.
Abstract
The reference-based object segmentation tasks, namely referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically designed and developed in different directions, which hinders the activation of multi-task capabilities for these tasks. In this work, we end the current fragmented situation and propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture. At the heart of our approach is the proposed UniFusion module which performs multiway-fusion for handling different tasks with respect to their specified references. And a unified Transformer architecture is then adopted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · VOS · Absolute Position Encodings · Dropout · Layer Normalization · Residual Connection
