MACAROON: Training Vision-Language Models To Be Your Engaged Partners
Shujin Wu, Yi R. Fung, Sha Li, Yixin Wan, Kai-Wei Chang, Heng Ji

TL;DR
This paper introduces MACAROON, a training method that enhances vision-language models to proactively engage with users by asking clarifying questions, significantly improving their engagement capabilities without sacrificing general performance.
Contribution
The study develops a hierarchical question framework, creates the PIE evaluation dataset, and proposes MACAROON, a novel training approach that boosts LVLMs' proactive engagement abilities.
Findings
Existing LVLMs perform poorly in proactive engagement (AAR 0.28).
MACAROON improves engagement performance to 0.84 AAR.
The method maintains comparable general task performance.
Abstract
Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues. Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses. In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners. We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs. Utilizing this hierarchy, we create PIE, (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOrganizational Strategy and Culture
MethodsALIGN
