Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen,, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu,, Yanfeng Wang

TL;DR
This paper enhances contrastive vision-language pre-training by incorporating diverse multi-text data and multi-branch image encoders, leading to more interpretable and generalizable models that outperform traditional CLIP on various benchmarks.
Contribution
It introduces a holistic CLIP framework with multi-text generation and multi-to-multi contrastive learning, addressing biases and shallow visual expressivity in existing models.
Findings
Significant performance improvements on over ten benchmarks.
Enhanced interpretability and visual diversity in embeddings.
Outperforms existing CLIP variants in retrieval and classification tasks.
Abstract
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEFL/ESL Teaching and Learning
