Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James, Zou, Kai-Wei Chang, Wei Wang

TL;DR
This paper introduces STIC, a self-training method for large vision language models that enhances image comprehension by leveraging unlabeled images, resulting in significant performance improvements with less supervised data.
Contribution
The paper proposes a novel self-training approach tailored for LVLMs, focusing on image comprehension and reducing reliance on labeled data.
Findings
Achieved 4.0% average performance improvement across seven benchmarks.
Reduced supervised fine-tuning data requirement by 70%.
Demonstrated effectiveness of self-constructed datasets for model enhancement.
Abstract
Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
