Referring Transformer: A One-step Approach to Multi-task Visual Grounding
Muchen Li, Leonid Sigal

TL;DR
This paper introduces a one-stage transformer-based framework for visual grounding that unifies phrase localization and segmentation, outperforming previous methods through contextualized multi-task learning.
Contribution
The paper presents a novel single-stage transformer model that integrates multi-task visual grounding, simplifying architecture and improving performance over prior two-stage or complex models.
Findings
Outperforms state-of-the-art on REC and RES tasks
Benefits significantly from contextualized information
Pre-training further enhances accuracy
Abstract
As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
