Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Shivam Chandhok, Qian Yang, Oscar Manas, Kanishk Jain, Leonid Sigal, Aishwarya Agrawal

TL;DR
PROGRESS is a novel, efficient framework for vision-language model instruction tuning that dynamically selects informative samples based on learning progress, reducing data and supervision needs while improving performance.
Contribution
It introduces a sample selection method based on relative error-driven learning progress that requires no upfront annotations or auxiliary supervision.
Findings
Outperforms state-of-the-art baselines with less data and supervision.
Demonstrates strong generalization across architectures and transferability to larger models.
Requires no additional heavy gradient computations for data selection.
Abstract
Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale datasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples-those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
### Significance and Practicality The paper addresses the critical bottleneck of high computational and annotation costs in VLM training. This is a highly relevant and valuable research direction that is crucial for advancing research and accessibility within the community. ### Pragmatic Framework Design The method avoids reliance on additional large-scale auxiliary models or a fully pre-annotated dataset. Its "on-demand" annotation strategy makes it more feasible and scalable for real-world
### The Validity of the "Skill" Definition Requires Further Substantiation The framework's entire premise rests on the assumption that unsupervised clustering effectively partitions data into meaningful, distinct "skills." However, the case studies in Figure 3 (e.g., the "Grounding" skill) show that samples within a cluster share nearly identical question phrasing. This raises a significant concern that the clustering may be predominantly driven by textual patterns from BERT rather than capturin
- Achieves near–full-data performance with a small labeled subset; strong results hold across multiple model sizes and benchmarks. - Shows that progress-driven sampling at a cluster/“skill” level with a simple temperature-controlled softmax yields a stable, effective curriculum. - Extensive Evaluation across diverse benchmarks, model families, budget scales, and wall-clock comparisons. - Thorough pipeline, tables, and ablation studies; appendices provide hyper-parameters and implementation d
**Characterization of skills**: The approach implicitly treats DINO+BERT clusters as skills, but the manuscript provides limited evidence that these clusters correspond to meaningful or stable competencies. A more concrete analysis such as semantic labeling of clusters, stability across feature backbones and seeds, and sensitivity to the number of clusters, would help readers understand what is being learned and whether the method targets distinct capabilities rather than surface correlations. (
This paper proposes PROGRESS, whose core advantage lies in its extremely high data, annotation, and computational efficiency. It adopts a dynamic strategy inspired by curriculum learning, enabling the model to proactively select samples during training based on its own evolving needs. The method tracks the learning progress of different skills and prioritizes samples where the model achieves the "fastest progress," thereby effectively controlling the order in which skills are acquired. This str
1. This method requires a "warmup" phase (using 9% of the data in the paper), which is claimed to enable the model to obtain a "reliable" initial performance evaluation. The problem is that it can be seen from Appendix Figure 10(a) that the warmup ratio is extremely critical. A 9% ratio yields the best results, while ratios of 3% or 12% both lead to a significant decline in performance. This means the "initial state" of the model greatly affects the subsequent calculation of "relative progress".
1. The hypothesis of using the model's own feedback to select informative samples is interesting. 2. The whole idea is quite easy to understand and implement. 3. The experimental results are promising, showing that the method can reduce the amount of data needed.
1. The motivation is not clear. It is unclear about the skill acquisition and how to prioritize the concept learning. What are the connections between skill and concept? This would be better clarified with concrete examples. 2. This paper seems shallow in VLMs, lacking technical novelty and insights concerning the vision part. The proposed method seems to be a method for all models based on LLMs. 3. The authors claim that tuning the temperature $\tau$ in softmax can balance informativeness and
1. The paper is well-motivated, facilitating MLLM finetuning with multi-round sample selection and curriculum learning. 2. The proposed feature extraction solution, which uses unsupervised specialist models for clustering, presents a compelling and potentially more effective alternative to COINCIDE.
1. The method's novelty appears incremental as it is heavily built upon the COINCIDE framework. A significant concern is its reliance on an initial 9% data query from the prior method, based on specialist features, for its warm-up phase. This dependency undermines the convincingness of the proposed method's standalone effectiveness, particularly when the main results are reported using only a 20% data subset. 2. The framework has significant limitations for practical application. While multi-ro
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
