VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language   Matching

Chenchi Zhang; Wenbo Ma; Jun Xiao; Hanwang Zhang; Jian Shao; Yueting; Zhuang; Long Chen

arXiv:2105.05636·cs.CV·January 6, 2023

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

Chenchi Zhang, Wenbo Ma, Jun Xiao, Hanwang Zhang, Jian Shao, Yueting, Zhuang, Long Chen

PDF

Open Access

TL;DR

VL-NMS introduces a query-aware proposal filtering method for two-stage visual-language matching, significantly improving recall and performance by aligning proposals with critical objects mentioned in text queries.

Contribution

This paper presents VL-NMS, the first approach to generate query-aware proposals in the initial detection stage, enhancing two-stage multimodal matching methods.

Findings

01

VL-NMS improves matching recall across benchmarks.

02

VL-NMS enhances performance in referring expression grounding.

03

VL-NMS is compatible with existing two-stage methods.

Abstract

The prevailing framework for matching multimodal inputs is based on a two-stage process: 1) detecting proposals with an object detector and 2) matching text queries with proposals. Existing two-stage solutions mostly focus on the matching step. In this paper, we argue that these methods overlook an obvious \emph{mismatch} between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i.e., query-aware). Due to this mismatch, chances are that proposals relevant to the text query are suppressed during the filtering process, which in turn bounds the matching performance. To this end, we propose VL-NMS, which is the first method to yield query-aware proposals at the first stage. VL-NMS regards all mentioned instances as critical objects,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning