ILLUME: Rationalizing Vision-Language Models through Human Interactions
Manuel Brack, Patrick Schramowski, Bj\"orn Deiseroth, Kristian, Kersting

TL;DR
ILLUME is a human-in-the-loop tuning method that improves vision-language models' rationalization abilities by iteratively incorporating human feedback on generated rationales, achieving competitive performance with less data.
Contribution
This paper introduces ILLUME, a novel human interaction-based tuning paradigm that enhances VLMs' rationalization aligned with human intent using minimal feedback.
Findings
ILLUME achieves competitive results with fewer training data.
The method effectively aligns model rationales with human preferences.
Minimal human feedback suffices for significant improvements.
Abstract
Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering. However, outputs of these models rarely align with user's rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, we propose a tuning paradigm based on human interactions with machine-generated data. Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intent. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised finetuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsALIGN
