VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching
Chenchi Zhang, Wenbo Ma, Jun Xiao, Hanwang Zhang, Jian Shao, Yueting, Zhuang, Long Chen

TL;DR
VL-NMS introduces a query-aware proposal filtering method for two-stage visual-language matching, significantly improving recall and performance by aligning proposals with critical objects mentioned in text queries.
Contribution
This paper presents VL-NMS, the first approach to generate query-aware proposals in the initial detection stage, enhancing two-stage multimodal matching methods.
Findings
VL-NMS improves matching recall across benchmarks.
VL-NMS enhances performance in referring expression grounding.
VL-NMS is compatible with existing two-stage methods.
Abstract
The prevailing framework for matching multimodal inputs is based on a two-stage process: 1) detecting proposals with an object detector and 2) matching text queries with proposals. Existing two-stage solutions mostly focus on the matching step. In this paper, we argue that these methods overlook an obvious \emph{mismatch} between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i.e., query-aware). Due to this mismatch, chances are that proposals relevant to the text query are suppressed during the filtering process, which in turn bounds the matching performance. To this end, we propose VL-NMS, which is the first method to yield query-aware proposals at the first stage. VL-NMS regards all mentioned instances as critical objects,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
