Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Myeongkyun Kang; Soopil Kim; Xiaoxiao Li; Sang Hyun Park

arXiv:2603.19482·cs.CV·April 28, 2026

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park

PDF

TL;DR

This paper introduces an instruction-free fine-tuning method for large vision language models in medical imaging, using image-description pairs and a momentum proxy instruction to improve domain-specific task performance.

Contribution

It proposes a novel instruction-free tuning approach with a momentum proxy instruction and response shuffling, enabling effective medical domain adaptation without handcrafted instructions.

Findings

01

Achieved state-of-the-art accuracy on multiple medical visual question answering datasets.

02

Enhanced fine-tuning efficiency for medical vision language models.

03

Demonstrated effectiveness of instruction-free tuning in specialized domains.

Abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.