Instruction-Following Evaluation of Large Vision-Language Models
Daiki Shiono, Shumpei Miyawaki, Ryota Tanaka, Jun Suzuki

TL;DR
This paper investigates why large vision-language models lose instruction-following ability after fine-tuning and shows that explicitly including output format instructions during training can improve their performance.
Contribution
It provides a quantitative analysis of the decline in instruction-following ability post-fine-tuning and demonstrates the effectiveness of including output format instructions during training.
Findings
LVLMs' instruction-following ability declines after fine-tuning.
Including output format instructions improves instruction-following accuracy.
Explicit format instructions during training mitigate performance decline.
Abstract
Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Language and cultural evolution
