On the Evaluation and Refinement of Vision-Language Instruction Tuning   Datasets

Ning Liao; Shaofeng Zhang; Renqiu Xia; Min Cao; Yu Qiao; Junchi Yan

arXiv:2310.06594·cs.CV·January 2, 2024

On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets

Ning Liao, Shaofeng Zhang, Renqiu Xia, Min Cao, Yu Qiao, Junchi Yan

PDF

Open Access

TL;DR

This paper proposes a new evaluation paradigm for vision-language instruction tuning datasets, introduces metrics for dataset and sample quality, and constructs a comprehensive dataset REVO-LION that enhances model performance and benchmarking.

Contribution

It introduces a tune-cross-evaluation paradigm, defines new metrics for dataset and sample quality, and creates REVO-LION, a high-quality dataset for improving and benchmarking VLIT models.

Findings

01

The proposed evaluation paradigm is validated through extensive experiments.

02

REVO-LION enables training models with comparable performance using only half the data.

03

REVO-LION serves as both a training resource and a benchmark for future research.

Abstract

There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks has been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper, we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective evaluation of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Natural Language Processing Techniques