Vision Model Pre-training on Interleaved Image-Text Data via Latent   Compression Learning

Chenyu Yang; Xizhou Zhu; Jinguo Zhu; Weijie Su; Junjie Wang; Xuan; Dong; Wenhai Wang; Lewei Lu; Bin Li; Jie Zhou; Yu Qiao; Jifeng Dai

arXiv:2406.07543·cs.CV·December 23, 2024·1 cites

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Chenyu Yang, Xizhou Zhu, Jinguo Zhu, Weijie Su, Junjie Wang, Xuan, Dong, Wenhai Wang, Lewei Lu, Bin Li, Jie Zhou, Yu Qiao, Jifeng Dai

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces Latent Compression Learning (LCL), a novel pre-training method for vision models that effectively exploits interleaved image-text data by maximizing mutual information, leading to robust visual representations.

Contribution

The paper presents a new pre-training approach called LCL that leverages interleaved image-text data using latent compression learning, a concept inspired by NLP success.

Findings

01

LCL matches CLIP performance on paired datasets.

02

LCL effectively learns from interleaved image-text data.

03

The method demonstrates robustness in visual representation learning.

Abstract

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/lcl
pytorchOfficial

Models

🤗
OpenGVLab/LCL-ViT-B-16-Laion
model· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training · Contrastive Learning