Suspected Object Matters: Rethinking Model's Prediction for One-stage   Visual Grounding

Yang Jiao; Zequn Jie; Jingjing Chen; Lin Ma; Yu-Gang Jiang

arXiv:2203.05186·cs.CV·August 22, 2023

Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

Yang Jiao, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper introduces a novel approach for one-stage visual grounding that focuses on modeling relationships among suspected objects, improving accuracy and efficiency by dynamically identifying and re-evaluating confusing objects during training.

Contribution

The paper proposes the Suspected Object Transformation mechanism (SOT), along with Keyword-Aware Discrimination and Exploration strategies, to enhance one-stage visual grounders by better handling ambiguous objects.

Findings

01

Significant accuracy improvements on benchmark datasets.

02

Enhanced model ability to distinguish target objects among confusing candidates.

03

Effective integration with existing CNN and Transformer-based models.

Abstract

Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all objects, as only part of them are related to the text query and may confuse the model. We call these objects suspected objects. However, exploring their relationships in the one-stage paradigm is non-trivial because: First, no object proposals are available as the basis on which to select suspected objects and perform relationship modeling. Second, suspected objects are more confusing than others, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model prediction. Toward this end, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition