COPA: Efficient Vision-Language Pre-training Through Collaborative   Object- and Patch-Text Alignment

Chaoya Jiang; Haiyang Xu; Wei Ye; Qinghao Ye; Chenliang Li; Ming Yan,; Bin Bi; Shikun Zhang; Ji Zhang; Fei Huang

arXiv:2308.03475·cs.MM·February 27, 2024·1 cites

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan,, Bin Bi, Shikun Zhang, Ji Zhang, Fei Huang

PDF

Open Access

TL;DR

This paper presents COPA, a vision-language pre-training method that efficiently integrates object information into ViT models through a novel patch-text alignment, significantly speeding up inference while maintaining high performance.

Contribution

It introduces a patch-text alignment mechanism that converts object signals into patch-level cues, enabling efficient and detailed cross-modal learning without expensive object detection.

Findings

01

Achieves 88% speedup over previous models.

02

Maintains or improves downstream task performance.

03

Uses only 5% annotated images for training.

Abstract

Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications