HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Zhixiang Wei; Guangting Wang; Xiaoxiao Ma; Ke Mei; Huaian Chen; Yi Jin; Fengyun Rao

arXiv:2507.22431·cs.CV·July 31, 2025

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Zhixiang Wei, Guangting Wang, Xiaoxiao Ma, Ke Mei, Huaian Chen, Yi Jin, Fengyun Rao

PDF

TL;DR

This paper introduces HQ-CLIP, a method that uses large vision-language models to refine image-text datasets and improve CLIP models, achieving state-of-the-art results with less data.

Contribution

The work presents a novel LVLM-driven data refinement pipeline and a training paradigm that enhances CLIP performance by incorporating multi-grained annotations and negative descriptions.

Findings

01

HQ-CLIP outperforms standard CLIP on multiple benchmarks.

02

Refined dataset VLM-150M improves model training.

03

Achieves state-of-the-art zero-shot classification and retrieval results.

Abstract

Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.