Referring Transformer: A One-step Approach to Multi-task Visual   Grounding

Muchen Li; Leonid Sigal

arXiv:2106.03089·cs.CV·July 15, 2021·73 cites

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Muchen Li, Leonid Sigal

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a one-stage transformer-based framework for visual grounding that unifies phrase localization and segmentation, outperforming previous methods through contextualized multi-task learning.

Contribution

The paper presents a novel single-stage transformer model that integrates multi-task visual grounding, simplifying architecture and improving performance over prior two-stage or complex models.

Findings

01

Outperforms state-of-the-art on REC and RES tasks

02

Benefits significantly from contextualized information

03

Pre-training further enhances accuracy

Abstract

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ubc-vision/RefTR
pytorchOfficial

Videos

Referring Transformer: A One-step Approach to Multi-task Visual Grounding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning