Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding
Yang Jiao, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang

TL;DR
This paper introduces a novel approach for one-stage visual grounding that focuses on modeling relationships among suspected objects, improving accuracy and efficiency by dynamically identifying and re-evaluating confusing objects during training.
Contribution
The paper proposes the Suspected Object Transformation mechanism (SOT), along with Keyword-Aware Discrimination and Exploration strategies, to enhance one-stage visual grounders by better handling ambiguous objects.
Findings
Significant accuracy improvements on benchmark datasets.
Enhanced model ability to distinguish target objects among confusing candidates.
Effective integration with existing CNN and Transformer-based models.
Abstract
Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all objects, as only part of them are related to the text query and may confuse the model. We call these objects suspected objects. However, exploring their relationships in the one-stage paradigm is non-trivial because: First, no object proposals are available as the basis on which to select suspected objects and perform relationship modeling. Second, suspected objects are more confusing than others, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model prediction. Toward this end, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
