Advancing Myopia To Holism: Fully Contrastive Language-Image   Pre-training

Haicheng Wang; Chen Ju; Weixiong Lin; Shuai Xiao; Mengting Chen,; Yixuan Huang; Chang Liu; Mingshuai Yao; Jinsong Lan; Ying Chen; Qingwen Liu,; Yanfeng Wang

arXiv:2412.00440·cs.CV·December 3, 2024

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen,, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu,, Yanfeng Wang

PDF

Open Access

TL;DR

This paper enhances contrastive vision-language pre-training by incorporating diverse multi-text data and multi-branch image encoders, leading to more interpretable and generalizable models that outperform traditional CLIP on various benchmarks.

Contribution

It introduces a holistic CLIP framework with multi-text generation and multi-to-multi contrastive learning, addressing biases and shallow visual expressivity in existing models.

Findings

01

Significant performance improvements on over ten benchmarks.

02

Enhanced interpretability and visual diversity in embeddings.

03

Outperforms existing CLIP variants in retrieval and classification tasks.

Abstract

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEFL/ESL Teaching and Learning