TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

Dawei Yan; Pengcheng Li; Yang Li; Hao Chen; Qingguo Chen; Weihua Luo,; Wei Dong; Qingsen Yan; Haokui Zhang; Chunhua Shen

arXiv:2409.09564·cs.CV·September 23, 2024

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo,, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen

PDF

Open Access

TL;DR

TG-LLaVA introduces a novel approach to improve vision-language models by guiding the vision encoder with text through learnable latent embeddings, enhancing feature extraction and answer accuracy without extra training data.

Contribution

The paper proposes a new text-guided optimization method for vision encoders using learnable latent embeddings, offering an orthogonal improvement to existing VLMs.

Findings

01

Improves baseline performance without additional training data.

02

Enhances the vision encoder's ability to extract text-related features.

03

Consistently outperforms other methods across various datasets and settings.

Abstract

Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies

MethodsSparse Evolutionary Training · Focus