Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

Andrew Bai; Justin Cui; Ruochen Wang; Cho-Jui Hsieh

arXiv:2508.10339·cs.CV·August 15, 2025

Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

Andrew Bai, Justin Cui, Ruochen Wang, Cho-Jui Hsieh

PDF

TL;DR

This paper proposes a targeted instruction data selection method for vision-language models that improves benchmark performance by balancing the learning of visual concepts and skills, revealing the importance of tailored training data.

Contribution

It introduces a simple method to select training instructions based on benchmark-specific concepts or skills, enhancing performance across multiple benchmarks.

Findings

01

+0.9% average performance improvement over baselines

02

+1.5% improvement on skill-focused benchmarks

03

Highlights the importance of balancing concepts and skills in training data

Abstract

Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9\% over the best existing baseline averaged over all benchmarks and +1.5\% on the skill-focused subset. Our findings underscore the importance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.