Visual Instruction Tuning with Polite Flamingo
Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang

TL;DR
This paper introduces Polite Flamingo, a response rewriter that improves the politeness and formatting of vision-language model annotations, enhancing model responses and human preference.
Contribution
It proposes Polite Flamingo for response rewriting, creates the PF-1M dataset, and develops Clever Flamingo with novel tuning methods to improve multi-modal understanding and response politeness.
Findings
Clever Flamingo outperforms baselines in understanding tasks.
Polite Flamingo improves response politeness and formatting.
PF-1M dataset enhances training quality.
Abstract
Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately -- for instance, its "politeness" -- due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
