Position-guided Text Prompt for Vision-Language Pre-training

Alex Jinpeng Wang; Pan Zhou; Mike Zheng Shou; Shuicheng Yan

arXiv:2212.09737·cs.CV·June 8, 2023

Position-guided Text Prompt for Vision-Language Pre-training

Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a position-guided text prompt (PTP) method to improve the visual grounding ability of vision-language pre-training models, leading to better performance on cross-modal tasks without relying on object detectors during inference.

Contribution

The novel PTP paradigm reformulates visual grounding as a fill-in-the-blank task, significantly enhancing VLP models' grounding capabilities and efficiency.

Findings

01

Improves zero-shot Flickr30K retrieval (+4.8 recall@1)

02

Enhances COCO captioning (+5.3 CIDEr)

03

Achieves comparable results to object-detector based methods with faster inference

Abstract

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N \times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sail-sg/ptp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · BLIP: Bootstrapping Language-Image Pre-training · ALIGN