TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo,, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen

TL;DR
TG-LLaVA introduces a novel approach to improve vision-language models by guiding the vision encoder with text through learnable latent embeddings, enhancing feature extraction and answer accuracy without extra training data.
Contribution
The paper proposes a new text-guided optimization method for vision encoders using learnable latent embeddings, offering an orthogonal improvement to existing VLMs.
Findings
Improves baseline performance without additional training data.
Enhances the vision encoder's ability to extract text-related features.
Consistently outperforms other methods across various datasets and settings.
Abstract
Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsSparse Evolutionary Training · Focus
