FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
Pengxiang Li, Zhi Gao, Bofei Zhang, Tao Yuan, Yuwei Wu, Mehrtash, Harandi, Yunde Jia, Song-Chun Zhu, Qing Li

TL;DR
This paper introduces FIRE, a large dataset for training and evaluating vision-language models' ability to refine responses based on user feedback, along with a benchmark and a fine-tuned model demonstrating significant improvements.
Contribution
The paper presents FIRE, a novel feedback-refinement dataset and benchmark for VLMs, and introduces FIRE-LLaVA, a model that excels in feedback-based response refinement.
Findings
FIRE dataset contains 1.1 million multi-turn conversations.
FIRE-LLaVA outperforms untrained VLMs by 50% on feedback refinement tasks.
FIRE enables more efficient and accurate user-agent interactions.
Abstract
Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems
