Diffusion Instruction Tuning
Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare

TL;DR
Lavender is a supervised fine-tuning method that enhances vision-language models by aligning their attention mechanisms with those of advanced image generators like Stable Diffusion, leading to significant performance improvements with minimal data and computational resources.
Contribution
It introduces Lavender, a novel alignment-based fine-tuning approach that leverages image generation models to improve vision-language understanding efficiently.
Findings
Up to 30% performance gains on various tasks.
68% boost on challenging medical QA tasks.
Requires only 0.13 million training examples.
Abstract
We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need · Diffusion · Shrink and Fine-Tune
