On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets
Ning Liao, Shaofeng Zhang, Renqiu Xia, Min Cao, Yu Qiao, Junchi Yan

TL;DR
This paper proposes a new evaluation paradigm for vision-language instruction tuning datasets, introduces metrics for dataset and sample quality, and constructs a comprehensive dataset REVO-LION that enhances model performance and benchmarking.
Contribution
It introduces a tune-cross-evaluation paradigm, defines new metrics for dataset and sample quality, and creates REVO-LION, a high-quality dataset for improving and benchmarking VLIT models.
Findings
The proposed evaluation paradigm is validated through extensive experiments.
REVO-LION enables training models with comparable performance using only half the data.
REVO-LION serves as both a training resource and a benchmark for future research.
Abstract
There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks has been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper, we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective evaluation of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Natural Language Processing Techniques
