Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
Xuhui Zhan, Tyler Derr

TL;DR
Inverse-LLaVA introduces a novel multimodal learning approach that eliminates the need for alignment pre-training by mapping text embeddings into visual space, enabling effective reasoning without large image-text datasets.
Contribution
It proposes a new paradigm that inverts traditional mapping, reducing computational costs and challenging the necessity of alignment pre-training in multimodal models.
Findings
Improves reasoning tasks by +0.2% to +27.2%.
Decreases perception task performance, e.g., celebrity recognition by -49.5%.
Reduces computational requirements by 45%.
Abstract
Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling
