# Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

**Authors:** Manish Nagaraj, Deepak Ravikumar, Kaushik Roy

arXiv: 2508.20230 · 2025-11-20

## TL;DR

This paper introduces CLD, a scalable and efficient method for selecting impactful training data based on loss trajectory correlation, improving model training efficiency and transferability across architectures.

## Contribution

We propose CLD, a novel coreset selection metric based on loss difference correlation, with theoretical guarantees and superior empirical performance.

## Key findings

- CLD outperforms state-of-the-art subset selection methods on CIFAR-100 and ImageNet-1k.
- CLD maintains high accuracy with less computational cost and transfers effectively across architectures.
- CLD is stable with early checkpoints and reduces bias through per-class validation alignment.

## Abstract

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20230/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20230/full.md

---
Source: https://tomesphere.com/paper/2508.20230