Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan

TL;DR
This paper introduces a position-guided text prompt (PTP) method to improve the visual grounding ability of vision-language pre-training models, leading to better performance on cross-modal tasks without relying on object detectors during inference.
Contribution
The novel PTP paradigm reformulates visual grounding as a fill-in-the-blank task, significantly enhancing VLP models' grounding capabilities and efficiency.
Findings
Improves zero-shot Flickr30K retrieval (+4.8 recall@1)
Enhances COCO captioning (+5.3 CIDEr)
Achieves comparable results to object-detector based methods with faster inference
Abstract
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · BLIP: Bootstrapping Language-Image Pre-training · ALIGN
