Enhancing Vision-Language Model Pre-training with Image-text Pair   Pruning Based on Word Frequency

Mingliang Liang; Martha Larson

arXiv:2410.10879·cs.LG·December 11, 2024

Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

Mingliang Liang, Martha Larson

PDF

Open Access 1 Repo

TL;DR

This paper introduces WFPP, a data pruning method that enhances vision-language model training by balancing word frequencies in image-text pairs, leading to improved downstream performance and training efficiency.

Contribution

WFPP is a novel pruning technique that reduces high-frequency words without needing metadata, improving model training and performance.

Findings

01

Improves downstream task performance.

02

Speeds up pre-training with fewer samples.

03

Balances word frequency distribution in training data.

Abstract

We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method that improves the efficiency of VLMs. Unlike MetaCLIP, our method does not need metadata for pruning, but selects text-image pairs to prune based on the content of the text. Specifically, WFPP prunes text-image pairs containing high-frequency words across the entire training dataset. The effect of WFPP is to reduce the dominance of frequent words. The result a better balanced word-frequency distribution in the dataset, which is known to improve the training of word embedding models. After pre-training on the pruned subset, we fine-tuned the model on the entire dataset for one additional epoch to achieve better performance. Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks. WFPP also provides the advantage of speeding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MingliangLiang3/WFPP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsPruning · Contrastive Language-Image Pre-training