Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs

Selim Kuzucu; Muhammad Ferjad Naeem; Anna Kukleva; Federico Tombari; Bernt Schiele

arXiv:2507.00754·cs.CV·July 10, 2025

Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs

Selim Kuzucu, Muhammad Ferjad Naeem, Anna Kukleva, Federico Tombari, Bernt Schiele

PDF

Open Access

TL;DR

LUViT introduces a joint pre-training strategy that effectively integrates LLMs with Vision Transformers, enhancing visual understanding by bridging modality gaps and leveraging LLMs' semantic knowledge.

Contribution

The paper proposes LUViT, a novel pre-training approach that co-adapts ViTs and LLMs using MAE and LoRA, addressing modality mismatch and improving vision task performance.

Findings

01

LUViT outperforms existing methods on multiple vision benchmarks.

02

Joint pre-training enhances LLM's ability to interpret visual data.

03

The approach achieves more stable and efficient fine-tuning.

Abstract

The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds immense promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between text-centric pretraining of LLMs and vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning. As a result, LLM blocks are kept frozen while only the vision components are learned. As a remedy to these challenges, we introduce Language-Unlocked Vision Transformers (LUViT), a novel approach that bridges this modality mismatch through a synergistic pre-training strategy. LUViT co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications