Less is More: High-value Data Selection for Visual Instruction Tuning
Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong, Wen

TL;DR
This paper introduces TIVE, a data selection method that reduces visual instruction data by 85% based on influence and difficulty scores, maintaining or improving model performance across multiple benchmarks.
Contribution
The paper presents a novel high-value data selection approach for visual instruction tuning that significantly reduces training data and cost while preserving or enhancing model performance.
Findings
Using only 15% of data achieves comparable performance to full datasets.
Redundant data within visual instruction datasets can be effectively eliminated.
TIVE outperforms full-data models on several benchmarks.
Abstract
Visual instruction tuning is the key to building large vision language models~(LVLMs), which can greatly improve the task generalization and solving capabilities by learning a mixture of instruction data from diverse visual tasks. Previous work mostly collects multiple existing visual instruction datasets via heuristic ways for training (even more than a million instructions), which may introduce data redundancy and enlarge the training cost. To investigate this issue, we conduct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of instructions from several tasks even do not affect the performance. Based on the findings, we propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. In TIVE, we first estimate…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The experiments are complete across various MLLM backbones, including Vicuna, Phi, and LLaMA3, and architectures, including LLaVA-1.5, SVIT-Mix, and Mini-Gemini. The authors also show comparisons with baselines / advanced MLLMs. 2. The performance meets the full baselines with only 10% to 30% training data, which shows the effectiveness of TIVE. 3. The paper is well written and the formulation periods are clear.
1. The main weakness lies in the design of the approach, especially regarding the computation costs. In my recognition, the inference operation based on the gradients and other selection operations are costly, even meets the original training cost. This makes the contribution of the pruning method weak. 2. The selection based on gradients is a posterior probability, which means choosing the hard samples as prior knowledge. This may be unfair for the comparisons against baselines. 3. The overall
The observation of the dataset redundancy problem aligns with the community's observations. The proposed TIVE method sounds reasonable. The authors conduct extensive experiments with detailed analysis to demonstrate the effectiveness of the method and its components.
1. Though the experiments are comprehensive, there are several points to further discuss or clarify in the method. See questions. 2. The authors need to provide a further discussion on the overall cost of the method: as TIVE needs the reference model trained with warmup data, the selection of TIVE is generally model-specific. TIVE needs to compute the LoRA gradient over all samples in the pool, then this cost is close to training on all of the data with LoRA. Tuning the hyper-parameters of HIVE
1. The paper introduces a well-justified and innovative method, TIVE, that addresses data redundancy in visual instruction datasets for LVLMs. 2. The motivation for addressing redundancy is well explained, and the proposed solution is logically developed based on detailed empirical findings. 3. The authors provide thorough empirical evidence demonstrating the existence of redundancy within current visual instruction datasets, supporting the motivation for their approach.
1. The paper does not sufficiently discuss the potential limitations of the TIVE approach, such as its scalability to even larger datasets or its applicability to different types of multimodal tasks. 2. I have some concerns regarding the data selection approach. In the earlier stages of machine learning, data and feature selection were widely popular. However, recent trends show that using larger models with bigger datasets tends to yield remarkable generalization capabilities. I hope the author
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
