Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

Xuhui Zhan; Tyler Derr

arXiv:2508.12466·cs.CV·August 19, 2025

Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

Xuhui Zhan, Tyler Derr

PDF

Open Access

TL;DR

Inverse-LLaVA introduces a novel multimodal learning approach that eliminates the need for alignment pre-training by mapping text embeddings into visual space, enabling effective reasoning without large image-text datasets.

Contribution

It proposes a new paradigm that inverts traditional mapping, reducing computational costs and challenging the necessity of alignment pre-training in multimodal models.

Findings

01

Improves reasoning tasks by +0.2% to +27.2%.

02

Decreases perception task performance, e.g., celebrity recognition by -49.5%.

03

Reduces computational requirements by 45%.

Abstract

Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling