Unpaired Referring Expression Grounding via Bidirectional Cross-Modal   Matching

Hengcan Shi; Munawar Hayat; Jianfei Cai

arXiv:2201.06686·cs.CV·June 7, 2022

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Hengcan Shi, Munawar Hayat, Jianfei Cai

PDF

Open Access 1 Models

TL;DR

This paper introduces a bidirectional cross-modal matching framework for unpaired referring expression grounding, combining top-down and bottom-up strategies with pretrained models to improve performance without paired annotations.

Contribution

The paper proposes a novel BiCM framework with query-aware attention, cross-modal object matching using CLIP, and knowledge adaptation, advancing unpaired referring grounding methods.

Findings

01

Outperforms previous methods by 6.55% and 9.94% on two datasets.

02

Introduces a query-aware attention map for top-down guidance.

03

Leverages CLIP for bottom-up object matching.

Abstract

Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps. A cross-modal object matching (COM) module is further introduced, which exploits the recently emerged image-text matching pretrained model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training