Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, Gao Huang

TL;DR
Pseudo-Q is a novel approach that automatically generates pseudo language queries from unlabeled images to train visual grounding models, significantly reducing annotation costs while maintaining high performance.
Contribution
The paper introduces Pseudo-Q, a method that creates pseudo language queries for visual grounding, eliminating the need for manual annotations and improving weakly-supervised learning.
Findings
Reduces human annotation costs by 31% on RefCOCO.
Achieves superior or comparable performance to state-of-the-art methods.
Effective in all five datasets tested.
Abstract
Visual grounding, i.e., localizing objects in images according to natural language queries, is an important topic in visual language understanding. The most effective approaches for this task are based on deep learning, which generally require expensive manually labeled image-query or patch-query pairs. To eliminate the heavy dependence on human annotations, we present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module. Then, we design a task-related query prompt module to specifically tailor generated pseudo language queries for visual grounding tasks. Further, in order to fully capture the contextual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
