Enhancing Large Vision Language Models with Self-Training on Image   Comprehension

Yihe Deng; Pan Lu; Fan Yin; Ziniu Hu; Sheng Shen; Quanquan Gu; James; Zou; Kai-Wei Chang; Wei Wang

arXiv:2405.19716·cs.CV·November 26, 2024·1 cites

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James, Zou, Kai-Wei Chang, Wei Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces STIC, a self-training method for large vision language models that enhances image comprehension by leveraging unlabeled images, resulting in significant performance improvements with less supervised data.

Contribution

The paper proposes a novel self-training approach tailored for LVLMs, focusing on image comprehension and reducing reliance on labeled data.

Findings

01

Achieved 4.0% average performance improvement across seven benchmarks.

02

Reduced supervised fine-tuning data requirement by 70%.

03

Demonstrated effectiveness of self-constructed datasets for model enhancement.

Abstract

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yihedeng9/stic
pytorchOfficial

Videos

Enhancing Large Vision Language Models with Self-Training on Image Comprehension· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications