Less is More: High-value Data Selection for Visual Instruction Tuning

Zikang Liu; Kun Zhou; Wayne Xin Zhao; Dawei Gao; Yaliang Li; Ji-Rong; Wen

arXiv:2403.09559·cs.CL·October 11, 2024·1 cites

Less is More: High-value Data Selection for Visual Instruction Tuning

Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong, Wen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces TIVE, a data selection method that reduces visual instruction data by 85% based on influence and difficulty scores, maintaining or improving model performance across multiple benchmarks.

Contribution

The paper presents a novel high-value data selection approach for visual instruction tuning that significantly reduces training data and cost while preserving or enhancing model performance.

Findings

01

Using only 15% of data achieves comparable performance to full datasets.

02

Redundant data within visual instruction datasets can be effectively eliminated.

03

TIVE outperforms full-data models on several benchmarks.

Abstract

Visual instruction tuning is the key to building large vision language models~(LVLMs), which can greatly improve the task generalization and solving capabilities by learning a mixture of instruction data from diverse visual tasks. Previous work mostly collects multiple existing visual instruction datasets via heuristic ways for training (even more than a million instructions), which may introduce data redundancy and enlarge the training cost. To investigate this issue, we conduct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of instructions from several tasks even do not affect the performance. Based on the findings, we propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. In TIVE, we first estimate…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

1. The experiments are complete across various MLLM backbones, including Vicuna, Phi, and LLaMA3, and architectures, including LLaVA-1.5, SVIT-Mix, and Mini-Gemini. The authors also show comparisons with baselines / advanced MLLMs. 2. The performance meets the full baselines with only 10% to 30% training data, which shows the effectiveness of TIVE. 3. The paper is well written and the formulation periods are clear.

Weaknesses

1. The main weakness lies in the design of the approach, especially regarding the computation costs. In my recognition, the inference operation based on the gradients and other selection operations are costly, even meets the original training cost. This makes the contribution of the pruning method weak. 2. The selection based on gradients is a posterior probability, which means choosing the hard samples as prior knowledge. This may be unfair for the comparisons against baselines. 3. The overall

Reviewer 02Rating 6Confidence 5

Strengths

The observation of the dataset redundancy problem aligns with the community's observations. The proposed TIVE method sounds reasonable. The authors conduct extensive experiments with detailed analysis to demonstrate the effectiveness of the method and its components.

Weaknesses

1. Though the experiments are comprehensive, there are several points to further discuss or clarify in the method. See questions. 2. The authors need to provide a further discussion on the overall cost of the method: as TIVE needs the reference model trained with warmup data, the selection of TIVE is generally model-specific. TIVE needs to compute the LoRA gradient over all samples in the pool, then this cost is close to training on all of the data with LoRA. Tuning the hyper-parameters of HIVE

Reviewer 03Rating 5Confidence 3

Strengths

1. The paper introduces a well-justified and innovative method, TIVE, that addresses data redundancy in visual instruction datasets for LVLMs. 2. The motivation for addressing redundancy is well explained, and the proposed solution is logically developed based on detailed empirical findings. 3. The authors provide thorough empirical evidence demonstrating the existence of redundancy within current visual instruction datasets, supporting the motivation for their approach.

Weaknesses

1. The paper does not sufficiently discuss the potential limitations of the TIVE approach, such as its scalability to even larger datasets or its applicability to different types of multimodal tasks. 2. I have some concerns regarding the data selection approach. In the earlier stages of machine learning, data and feature selection were widely popular. However, recent trends show that using larger models with bigger datasets tends to yield remarkable generalization capabilities. I hope the author

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics