Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution
Chonghan Chen, Qi Jiang, Chih-Hao Wang, Noel Chen, Haohan Wang, Xiang, Li, Bhiksha Raj

TL;DR
This paper introduces a Query-conditioned Convolution Module (QCM) that enhances visual grounding by generating query-aware visual features, leading to improved accuracy and state-of-the-art results on multiple datasets.
Contribution
The paper proposes a novel QCM that incorporates query information into convolutional kernels, improving feature discrimination for visual grounding tasks.
Findings
Achieves state-of-the-art performance on three datasets.
Query-aware features are highly informative for object prediction.
Method performs well even without multi-modal fusion.
Abstract
Visual grounding is a task that aims to locate a target object according to a natural language expression. As a multi-modal task, feature interaction between textual and visual inputs is vital. However, previous solutions mainly handle each modality independently before fusing them together, which does not take full advantage of relevant textual information while extracting visual features. To better leverage the textual-visual relationship in visual grounding, we propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels. With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions. Extensive experiments on three popular visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Subtitles and Audiovisual Media
MethodsConvolution
