FIRE: A Dataset for Feedback Integration and Refinement Evaluation of   Multimodal Models

Pengxiang Li; Zhi Gao; Bofei Zhang; Tao Yuan; Yuwei Wu; Mehrtash; Harandi; Yunde Jia; Song-Chun Zhu; Qing Li

arXiv:2407.11522·cs.CV·December 3, 2024

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

Pengxiang Li, Zhi Gao, Bofei Zhang, Tao Yuan, Yuwei Wu, Mehrtash, Harandi, Yunde Jia, Song-Chun Zhu, Qing Li

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces FIRE, a large dataset for training and evaluating vision-language models' ability to refine responses based on user feedback, along with a benchmark and a fine-tuned model demonstrating significant improvements.

Contribution

The paper presents FIRE, a novel feedback-refinement dataset and benchmark for VLMs, and introduces FIRE-LLaVA, a model that excels in feedback-based response refinement.

Findings

01

FIRE dataset contains 1.1 million multi-turn conversations.

02

FIRE-LLaVA outperforms untrained VLMs by 50% on feedback refinement tasks.

03

FIRE enables more efficient and accurate user-agent interactions.

Abstract

Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PengxiangLi/FIRE
dataset· 30 dl
30 dl

Videos

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models· slideslive

Taxonomy

TopicsSpeech and dialogue systems