Bear the Query in Mind: Visual Grounding with Query-conditioned   Convolution

Chonghan Chen; Qi Jiang; Chih-Hao Wang; Noel Chen; Haohan Wang; Xiang; Li; Bhiksha Raj

arXiv:2206.09114·cs.CV·June 23, 2022

Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

Chonghan Chen, Qi Jiang, Chih-Hao Wang, Noel Chen, Haohan Wang, Xiang, Li, Bhiksha Raj

PDF

Open Access

TL;DR

This paper introduces a Query-conditioned Convolution Module (QCM) that enhances visual grounding by generating query-aware visual features, leading to improved accuracy and state-of-the-art results on multiple datasets.

Contribution

The paper proposes a novel QCM that incorporates query information into convolutional kernels, improving feature discrimination for visual grounding tasks.

Findings

01

Achieves state-of-the-art performance on three datasets.

02

Query-aware features are highly informative for object prediction.

03

Method performs well even without multi-modal fusion.

Abstract

Visual grounding is a task that aims to locate a target object according to a natural language expression. As a multi-modal task, feature interaction between textual and visual inputs is vital. However, previous solutions mainly handle each modality independently before fusing them together, which does not take full advantage of relevant textual information while extracting visual features. To better leverage the textual-visual relationship in visual grounding, we propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels. With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions. Extensive experiments on three popular visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Subtitles and Audiovisual Media

MethodsConvolution