Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models
Andrew Bai, Justin Cui, Ruochen Wang, Cho-Jui Hsieh

TL;DR
This paper proposes a targeted instruction data selection method for vision-language models that improves benchmark performance by balancing the learning of visual concepts and skills, revealing the importance of tailored training data.
Contribution
It introduces a simple method to select training instructions based on benchmark-specific concepts or skills, enhancing performance across multiple benchmarks.
Findings
+0.9% average performance improvement over baselines
+1.5% improvement on skill-focused benchmarks
Highlights the importance of balancing concepts and skills in training data
Abstract
Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9\% over the best existing baseline averaged over all benchmarks and +1.5\% on the skill-focused subset. Our findings underscore the importance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
