Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park

TL;DR
This paper introduces an instruction-free fine-tuning method for large vision language models in medical imaging, using image-description pairs and a momentum proxy instruction to improve domain-specific task performance.
Contribution
It proposes a novel instruction-free tuning approach with a momentum proxy instruction and response shuffling, enabling effective medical domain adaptation without handcrafted instructions.
Findings
Achieved state-of-the-art accuracy on multiple medical visual question answering datasets.
Enhanced fine-tuning efficiency for medical vision language models.
Demonstrated effectiveness of instruction-free tuning in specialized domains.
Abstract
Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
