TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Chaoya Jiang, Haiyang Xu, Chenliang Li, Miang Yan, Wei Ye, Shikun Zhang, Bin Bi, Songfang Huang

TL;DR
TRIPS introduces a text-guided image patch selection method in vision-and-language pre-training that significantly improves computational efficiency without sacrificing performance.
Contribution
The paper proposes a novel patch-selection layer that dynamically identifies relevant image patches guided by text, reducing computation in ViTs without adding extra parameters.
Findings
Achieves 40% speedup over previous models.
Maintains or improves downstream task performance.
Demonstrates effectiveness across multiple benchmark datasets.
Abstract
Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the long visual sequence. To tackle this problem, in this paper, we propose an efficient vision-and-language pre-training model with \textbf{T}ext-\textbf{R}elevant \textbf{I}mage \textbf{P}atch \textbf{S}election, namely TRIPS, which reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference. The patch-selection layer can dynamically compute text-dependent visual attention to identify the attentive image tokens with text guidance and fuse inattentive ones in an end-to-end manner. Meanwhile, TRIPS does not introduce extra parameters to ViTs. Experimental results on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsInfoNCE · Contrastive Learning
