COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment
Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan,, Bin Bi, Shikun Zhang, Ji Zhang, Fei Huang

TL;DR
This paper presents COPA, a vision-language pre-training method that efficiently integrates object information into ViT models through a novel patch-text alignment, significantly speeding up inference while maintaining high performance.
Contribution
It introduces a patch-text alignment mechanism that converts object signals into patch-level cues, enabling efficient and detailed cross-modal learning without expensive object detection.
Findings
Achieves 88% speedup over previous models.
Maintains or improves downstream task performance.
Uses only 5% annotated images for training.
Abstract
Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
