Diffusion Instruction Tuning

Chen Jin; Ryutaro Tanno; Amrutha Saseendran; Tom Diethe; Philip Teare

arXiv:2502.06814·cs.LG·May 27, 2025

Diffusion Instruction Tuning

Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

Lavender is a supervised fine-tuning method that enhances vision-language models by aligning their attention mechanisms with those of advanced image generators like Stable Diffusion, leading to significant performance improvements with minimal data and computational resources.

Contribution

It introduces Lavender, a novel alignment-based fine-tuning approach that leverages image generation models to improve vision-language understanding efficiently.

Findings

01

Up to 30% performance gains on various tasks.

02

68% boost on challenging medical QA tasks.

03

Requires only 0.13 million training examples.

Abstract

We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lxasqjc/lavender-llama-3.2-11b-lora
model· ♡ 2
♡ 2

Datasets

lxasqjc/flickr1k-sd-attn
dataset· 1.3k dl
1.3k dl

Videos

Diffusion Instruction Tuning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need · Diffusion · Shrink and Fine-Tune