TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Chaoya Jiang; Haiyang Xu; Chenliang Li; Miang Yan; Wei Ye; Shikun Zhang; Bin Bi; Songfang Huang

arXiv:2305.04474·cs.CV·September 30, 2025·2 cites

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Chaoya Jiang, Haiyang Xu, Chenliang Li, Miang Yan, Wei Ye, Shikun Zhang, Bin Bi, Songfang Huang

PDF

Open Access

TL;DR

TRIPS introduces a text-guided image patch selection method in vision-and-language pre-training that significantly improves computational efficiency without sacrificing performance.

Contribution

The paper proposes a novel patch-selection layer that dynamically identifies relevant image patches guided by text, reducing computation in ViTs without adding extra parameters.

Findings

01

Achieves 40% speedup over previous models.

02

Maintains or improves downstream task performance.

03

Demonstrates effectiveness across multiple benchmark datasets.

Abstract

Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the long visual sequence. To tackle this problem, in this paper, we propose an efficient vision-and-language pre-training model with \textbf{T}ext-\textbf{R}elevant \textbf{I}mage \textbf{P}atch \textbf{S}election, namely TRIPS, which reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference. The patch-selection layer can dynamically compute text-dependent visual attention to identify the attentive image tokens with text guidance and fuse inattentive ones in an end-to-end manner. Meanwhile, TRIPS does not introduce extra parameters to ViTs. Experimental results on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsInfoNCE · Contrastive Learning