Do we Really Need Visual Instructions? Towards Visual Instruction-Free   Fine-tuning for Large Vision-Language Models

Zikang Liu; Kun Zhou; Wayne Xin Zhao; Dawei Gao; Yaliang Li; Ji-Rong; Wen

arXiv:2502.11427·cs.CL·February 18, 2025

Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong, Wen

PDF

Open Access

TL;DR

This paper introduces ViFT, a novel framework for fine-tuning large vision-language models without visual instructions, using only text instructions and image captions, achieving state-of-the-art results efficiently.

Contribution

ViFT enables visual instruction-free fine-tuning of LVLMs, reducing data requirements and inheriting task-solving capabilities from backbone LLMs.

Findings

01

Achieves state-of-the-art performance on visual reasoning benchmarks.

02

Requires less training data compared to traditional methods.

03

Effectively combines text and image representations during inference.

Abstract

Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale dataset. To address it, we propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. In ViFT, we only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. During inference, we extract and combine the representations of the text and image inputs, for fusing the two abilities to fulfill multimodal tasks. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques