Let ViT Speak: Generative Language-Image Pre-training

Yan Fang; Mengcheng Lan; Zilong Huang; Weixian Lei; Yunqing Zhao; Yujie Zhong; Yingchen Yu; Qi She; Yao Zhao; Yunchao Wei

arXiv:2605.00809·cs.CV·May 4, 2026

Let ViT Speak: Generative Language-Image Pre-training

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei

PDF

1 Repo

TL;DR

GenLIP introduces a simple, scalable, and effective generative pretraining framework for Vision Transformers, aligning vision and language modeling for multimodal large language models.

Contribution

It proposes a minimalist, autoregressive pretraining method for ViTs that improves multimodal model performance with less data and complexity.

Findings

01

Achieves competitive results on multimodal benchmarks.

02

Matches or surpasses baselines with less pretraining data.

03

Improves OCR and chart understanding after further training.

Abstract

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yanfangcs/GenLIP
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.