Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning
Rujie Wu, Haozhe Zhao, Hai Ci, Yizhou Wang

TL;DR
The paper introduces Goal-Driven Data Optimization (GDO), a framework that selects more effective training samples for multimodal instruction tuning, leading to faster convergence and higher accuracy with less data.
Contribution
GDO is a novel data selection method that constructs optimized training subsets tailored to specific goals, improving efficiency in multimodal instruction tuning.
Findings
GDO reduces training samples by over 90% compared to baseline.
GDO achieves faster convergence and higher accuracy across multiple benchmarks.
Temporal emphasis in data selection improves long-video understanding.
Abstract
Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1 training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition
