Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

Manish Nagaraj; Deepak Ravikumar; Kaushik Roy

arXiv:2508.20230·cs.LG·November 20, 2025

Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

Manish Nagaraj, Deepak Ravikumar, Kaushik Roy

PDF

TL;DR

This paper introduces CLD, a scalable and efficient method for selecting impactful training data based on loss trajectory correlation, improving model training efficiency and transferability across architectures.

Contribution

We propose CLD, a novel coreset selection metric based on loss difference correlation, with theoretical guarantees and superior empirical performance.

Findings

01

CLD outperforms state-of-the-art subset selection methods on CIFAR-100 and ImageNet-1k.

02

CLD maintains high accuracy with less computational cost and transfers effectively across architectures.

03

CLD is stable with early checkpoints and reduces bias through per-class validation alignment.

Abstract

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k,…

Tables6

Table 1. Table 1 : End-to-end compute and storage overhead for 𝙲𝙻𝙳 \mathtt{CLD} and baselines. Storage column lists method-specific extras during selection only . Notation in Section ˜ 7 .

Method	Computational cost (selection $+$ coreset training)	Storage overhead (method-specific extras)
$𝙷𝚎𝚛𝚍𝚒𝚗𝚐$	$3 N T_{proxy} f + N f + 𝒪 (N d k) + 3 k T f_{large}$	$𝒪 (N d)$ [features] (+ optional $𝒪 (k^{2})$ Gram)
[0.8pt/2pt] $𝙵𝚘𝚛𝚐𝚎𝚝𝚝𝚒𝚗𝚐$	$3 N T_{early} f_{large} + 3 k T_{late} f_{large}$	$𝒪 (N)$ [per-sample counter]
[0.8pt/2pt] $𝙰𝚄𝙼$	$3 N T_{proxy} f + 3 k T f_{large}$	$𝒪 (N)$ [running sums]
[0.8pt/2pt] $𝙲𝚊𝚕$	$3 k T_{proxy} f + U f + 𝒪 (U κ d) + 3 k T f_{large}$	$𝒪 (U d)$ [proxy features]
[0.8pt/2pt] $𝙶𝚛𝚊𝙽𝚍$	$3 N T_{early} R f_{large} + 3 N T_{early} R f_{large} + 3 k T_{late} f_{large}$	$𝒪 (N)$ [scores/logs]
[0.8pt/2pt] $𝙴𝙻𝟸𝙽$	$3 N T_{early} R f_{large} + 3 k T_{late} f_{large}$	$𝒪 (N)$ [scores]
[0.8pt/2pt] $𝙼𝚘𝚍𝚎𝚛𝚊𝚝𝚎$	$3 N T_{proxy} f + N f + 𝒪 (N d + N \log N) + 3 k T f_{large}$	$𝒪 (N d)$ [proxy features]
[0.8pt/2pt] $𝔻^{2} -Pruning$	$3 N T_{proxy} f + N f + 𝒪 (N κ d) + 𝒪 (H N κ) + 3 k T f_{large}$	$𝒪 (N d) + 𝒪 (N κ)$ [features + $k$ NN graph]
[0.8pt/2pt] $𝙲𝚁𝙰𝙸𝙶$	$3 N T_{proxy} f + 𝒪 (A (N k + N \log (1 / ϵ)) D_{eff}) + 3 k T f_{large}$	$𝒪 (N (F + c))$ [per-anchor embeddings]
[0.8pt/2pt] $𝙶𝚕𝚒𝚜𝚝𝚎𝚛$	$3 k T f_{large} + 𝒪 ((k Q + N \log (1 / ϵ)) f_{large} T / γ)$	$𝒪 (Q)$ [validation cache]
[0.8pt/2pt] $𝙶𝚛𝚊𝚙𝚑𝙲𝚞𝚝$	$3 N T_{proxy} f + N f + 𝒪 (N^{2} k) + 3 k T f_{large}$	$𝒪 (N^{2})$ [pairwise similarities]
[0.8pt/2pt] $𝚂𝚕𝚘𝙲𝚞𝚛𝚟$	$3 N T_{proxy} f + 3 N (R + 1) f + 3 k T f_{large}$	$𝒪 (N) + 𝒪 (R d)$ [running stats + probe dirs]
[0.8pt/2pt] $𝚃𝙳𝙳𝚂$	$3 N T_{proxy} f + 3 N T_{proxy} f + 3 k T f_{large}$	$𝒪 (N J)$ [windowed logs]
[0.8pt/2pt] $𝙳𝚢𝚗 - 𝚄𝚗𝚌$	$3 N T_{proxy} f + 3 k T f_{large}$	$𝒪 (N J)$ [windowed logs]
[0.8pt/2pt] $𝙳𝚄𝙰𝙻$	$3 N T_{proxy,early} f + 3 k T f_{large}$	$𝒪 (N J)$ [windowed logs]
$𝐂𝐋𝐃$ (Ours)	$𝟑 𝐍 𝐓_{proxy} 𝐟 + {𝐐𝐓}_{proxy} 𝐟 + 3 {𝐤𝐓𝐟}_{large}$	$𝒪 ((𝐍 + 𝐐) 𝐓_{proxy})$ [loss logs]

Table 2. Table 2 : Performance (top-1 accuracy) of score-based coreset methods on CIFAR100 trainsplit. The coresets were selected and finetuned on ResNet-18. The full trainset performance was 70.95 ± 0.68 \mathbf{70.95\pm 0.68} . The mean accuracy over 5 runs, along with their standard deviation, is reported.

Coreset

Sizes

𝚁𝚊𝚗𝚍𝚘𝚖

𝙷𝚎𝚛𝚍𝚒𝚗𝚐

𝙵𝚘𝚛𝚐𝚎𝚝𝚝𝚒𝚗𝚐

𝙲𝚊𝚕

𝙴𝙻𝟸𝙽

𝙼𝚘𝚍𝚎𝚛𝚊𝚝𝚎

𝙲𝙲𝚂 ​ (𝙰𝚄𝙼)

𝔻^{2}

-

𝙿𝚛𝚞𝚗𝚒𝚗𝚐

𝐂𝐋𝐃

(Ours)

0.2%

3.66

\pm

0.41

2.57

\pm

0.52

3.52

\pm

0.16

5.24

\pm

0.41

3.9

\pm

0.45

3.8

\pm

0.47

4.1

\pm

0.5

4.6

\pm

0.52

5.67

\pm

0.36

0.4%

6.03

\pm

0.28

3.42

\pm

0.49

5.12

\pm

0.53

7.46

\pm

0.28

6.2

\pm

0.41

6

\pm

0.44

6.9

\pm

0.42

7.4

\pm

0.46

7.51

\pm

0.59

0.6%

6.8

\pm

0.27

4.07

\pm

0.41

6.8

\pm

0.18

9.12

\pm

0.27

8.7

\pm

0.4

8.4

\pm

0.42

8.2

\pm

0.38

8.6

\pm

0.43

8.91

\pm

0.41

0.8%

7.96

\pm

0.79

5.14

\pm

0.7

8.42

\pm

0.38

10.19

\pm

0.79

9.8

\pm

0.48

10.2

\pm

0.49

10.5

\pm

0.55

10.5

\pm

0.5

10.56

\pm

0.24

1%

9.38

\pm

0.41

5.32

\pm

0.33

11.53

\pm

0.44

13.73

\pm

0.41

13.2

\pm

0.46

12.7

\pm

0.45

12.1

\pm

0.49

13.1

\pm

0.47

13.04

\pm

0.54

2%

12.74

\pm

0.28

8.29

\pm

0.45

15.9

\pm

0.21

16.45

\pm

0.28

17

\pm

0.43

16.3

\pm

0.43

16.4

\pm

0.46

17.5

\pm

0.45

17.05

\pm

0.22

3%

16.58

\pm

0.95

9.23

\pm

0.52

18.24

\pm

0.32

20.05

\pm

0.95

21.1

\pm

0.5

20

\pm

0.46

20.3

\pm

0.52

21.3

\pm

0.49

21.85

\pm

0.85

4%

19.99

\pm

1.04

11.52

\pm

1.16

23.82

\pm

0.86

22.97

\pm

1.04

24.5

\pm

0.47

23.2

\pm

0.44

23.9

\pm

0.5

24.1

\pm

0.46

24.01

\pm

0.24

5%

22.41

\pm

0.54

13.66

\pm

1.35

26.38

\pm

0.85

24.37

\pm

0.54

27.8

\pm

0.44

26.1

\pm

0.41

27.1

\pm

0.44

28.4

\pm

0.43

27.26

\pm

0.83

6%

23.66

\pm

0.49

15.49

\pm

0.92

28.16

\pm

0.62

26.93

\pm

0.49

29.9

\pm

0.4

28.4

\pm

0.39

29.4

\pm

0.4

30.8

\pm

0.4

31.24

\pm

0.19

7%

28.36

\pm

0.85

18.52

\pm

0.56

30.95

\pm

0.66

27.37

\pm

0.85

32

\pm

0.38

30.3

\pm

0.36

31.5

\pm

0.38

31.7

\pm

0.37

32.37

\pm

0.71

8%

30.75

\pm

1.12

18.52

\pm

0.39

31.84

\pm

0.23

28.32

\pm

1.12

33.2

\pm

0.36

31.6

\pm

0.34

32.6

\pm

0.36

32.9

\pm

0.35

33.31

\pm

0.68

9%

32.12

\pm

1.48

18.52

\pm

0.98

32.79

\pm

0.94

29.19

\pm

1.48

34.4

\pm

0.34

33

\pm

0.33

33.8

\pm

0.35

34.1

\pm

0.34

34.04

\pm

0.74

10%

32.75

\pm

1.02

19.54

\pm

0.85

33.04

\pm

0.65

31.02

\pm

1.02

35.8

\pm

0.32

34.2

\pm

0.31

35

\pm

0.33

35.4

\pm

0.32

35.81

\pm

0.21

20%

35.63

\pm

0.99

35.14

\pm

1.06

37.12

\pm

0.85

35.24

\pm

0.81

39.6

\pm

0.28

38.7

\pm

0.27

40.1

\pm

0.29

41.2

\pm

0.28

39.15

\pm

0.57

50%

43.17

\pm

1.02

44.14

\pm

0.4

45.78

\pm

0.41

42.18

\pm

0.69

47.2

\pm

0.24

46

\pm

0.23

48.3

\pm

0.25

49

\pm

0.24

46.18

\pm

0.13

75%

63.21

\pm

0.5

66.12

\pm

0.46

66.12

\pm

0.61

63.05

\pm

0.41

65.8

\pm

0.2

65.1

\pm

0.19

66.9

\pm

0.21

67.8

\pm

0.2

68.01

\pm

0.82

Table 3. Table 3 : Performance (top-1 accuracy) of optimization and training property-based coreset methods on CIFAR100 trainsplit. The coresets were selected and finetuned on ResNet-18. The full trainset performance was 70.95 ± 0.68 \mathbf{70.95\pm 0.68} . The mean accuracy over 5 runs, along with their standard deviation, is reported.

Coreset

Sizes

𝚁𝚊𝚗𝚍𝚘𝚖

𝙲𝚁𝙰𝙸𝙶

𝙶𝚕𝚒𝚜𝚝𝚎𝚛

𝙶𝚛𝚊𝚙𝚑𝙲𝚞𝚝

𝚂𝚕𝚘𝙲𝚞𝚛𝚟

𝚃𝙳𝙳𝚂

𝙳𝚢𝚗 - 𝚄𝚗𝚌

𝙳𝚄𝙰𝙻

𝐂𝐋𝐃

(Ours)

0.2%

3.66

\pm

0.41

3.5

\pm

0.42

3.43

\pm

0.32

5.8

\pm

0.24

3.62

\pm

0.44

3.7

\pm

0.44

2.1

\pm

0.55

2

\pm

0.56

5.67

\pm

0.36

0.4%

6.03

\pm

0.28

5.8

\pm

0.4

4.91

\pm

0.37

7.07

\pm

0.39

5.46

\pm

0.36

6

\pm

0.41

3.5

\pm

0.5

3.4

\pm

0.52

7.51

\pm

0.59

0.6%

6.8

\pm

0.27

7.9

\pm

0.39

7.49

\pm

0.61

8.96

\pm

0.31

7.44

\pm

0.2

8.2

\pm

0.4

5.1

\pm

0.48

5

\pm

0.49

8.91

\pm

0.41

0.8%

7.96

\pm

0.79

9.8

\pm

0.47

8.65

\pm

0.47

9.39

\pm

0.81

9.96

\pm

0.42

10.1

\pm

0.48

7

\pm

0.53

7.2

\pm

0.54

10.56

\pm

0.24

1%

9.38

\pm

0.41

12.2

\pm

0.44

9.04

\pm

0.43

11.86

\pm

0.47

12.17

\pm

0.3

12.8

\pm

0.45

9.5

\pm

0.5

10.4

\pm

0.51

13.04

\pm

0.54

2%

12.74

\pm

0.28

15.9

\pm

0.42

14.54

\pm

0.41

16.95

\pm

0.55

13.35

\pm

0.42

16.5

\pm

0.43

14.2

\pm

0.47

15.8

\pm

0.48

17.05

\pm

0.22

3%

16.58

\pm

0.95

19.6

\pm

0.46

17.47

\pm

0.32

19.21

\pm

0.57

22.67

\pm

0.8

20.3

\pm

0.47

18.9

\pm

0.5

20.5

\pm

0.5

21.85

\pm

0.85

4%

19.99

\pm

1.04

23

\pm

0.44

23.99

\pm

0.58

21.33

\pm

0.71

21.97

\pm

0.55

23.7

\pm

0.45

22.7

\pm

0.48

24.3

\pm

0.47

24.01

\pm

0.24

5%

22.41

\pm

0.54

25.8

\pm

0.4

24.83

\pm

0.73

26.31

\pm

0.58

23.44

\pm

0.71

26.6

\pm

0.42

25.8

\pm

0.44

27.5

\pm

0.43

27.26

\pm

0.83

6%

23.66

\pm

0.49

28.1

\pm

0.38

26.57

\pm

0.69

30.35

\pm

0.68

25.41

\pm

0.43

28.9

\pm

0.39

28.2

\pm

0.41

30

\pm

0.4

31.24

\pm

0.19

7%

28.36

\pm

0.85

30

\pm

0.36

27.57

\pm

0.84

31.63

\pm

0.71

27.45

\pm

0.6

30.8

\pm

0.37

30.3

\pm

0.38

32.1

\pm

0.38

32.37

\pm

0.71

8%

30.75

\pm

1.12

31.3

\pm

0.34

28.79

\pm

0.8

32.22

\pm

0.48

29.17

\pm

0.47

32

\pm

0.35

31.7

\pm

0.36

33.4

\pm

0.35

33.31

\pm

0.68

9%

32.12

\pm

1.48

32.7

\pm

0.32

30.22

\pm

0.32

33.02

\pm

0.83

30.71

\pm

0.57

33.4

\pm

0.33

33

\pm

0.34

34.8

\pm

0.33

34.04

\pm

0.74

10%

32.75

\pm

1.02

34.1

\pm

0.3

31.22

\pm

1.33

34.41

\pm

0.96

33.17

\pm

1.16

34.7

\pm

0.31

34.4

\pm

0.32

36.2

\pm

0.31

35.81

\pm

0.21

20%

35.63

\pm

0.99

38.2

\pm

0.26

36.49

\pm

0.47

38.02

\pm

0.52

37.23

\pm

0.81

39.1

\pm

0.27

38.8

\pm

0.28

40.5

\pm

0.27

39.15

\pm

0.57

50%

43.17

\pm

1.02

45.6

\pm

0.22

41.81

\pm

0.88

45.23

\pm

0.42

45.19

\pm

0.42

46.5

\pm

0.23

45.8

\pm

0.24

47.5

\pm

0.23

46.18

\pm

0.13

75%

63.21

\pm

0.5

64.8

\pm

0.19

63.85

\pm

1.02

65.18

\pm

0.32

66.53

\pm

0.61

65.9

\pm

0.2

64.5

\pm

0.2

66.2

\pm

0.19

68.01

\pm

0.82

Table 4. Table 4 : Performance (top-1 accuracy) of score-based coreset methods on ImageNet-1k trainsplit. The coresets were selected and finetuned on ResNet-18. The full trainset performance was 69.91 ± 0.01 \mathbf{69.91\pm 0.01} . The mean accuracy over 5 runs, along with their standard deviation, is reported.

Coreset

Sizes

𝚁𝚊𝚗𝚍𝚘𝚖

𝙷𝚎𝚛𝚍𝚒𝚗𝚐

𝙵𝚘𝚛𝚐𝚎𝚝𝚝𝚒𝚗𝚐

𝙲𝚊𝚕

𝙴𝙻𝟸𝙽

𝙼𝚘𝚍𝚎𝚛𝚊𝚝𝚎

𝙲𝙲𝚂 ​ (𝙰𝚄𝙼)

𝔻^{2}

-

𝙿𝚛𝚞𝚗𝚒𝚗𝚐

𝐂𝐋𝐃

0.1%

0.7

\pm

0.03

0.31

\pm

0.01

0.64

\pm

0.01

1.13

\pm

0.12

0.88

\pm

0.25

0.85

\pm

0.2

1.52

\pm

0.5

1.95

\pm

0.4

1.96

\pm

0.7

0.5%

3.98

\pm

0.19

1.39

\pm

0.17

4.78

\pm

1.01

6.84

\pm

0.13

5.83

\pm

0.1

4.75

\pm

0.5

7.04

\pm

0.1

7.2

\pm

0.25

7.16

\pm

0.63

1%

7.86

\pm

0.43

4.32

\pm

0.62

12.67

\pm

0.51

13.17

\pm

0.22

15.2

\pm

0.5

14.32

\pm

1.03

14.86

\pm

0.25

16.01

\pm

0.4

15.92

\pm

0.41

5%

39.78

\pm

0.23

15.36

\pm

0.18

44.86

\pm

0.74

37.65

\pm

1.3

40.43

\pm

0.03

38.95

\pm

0.25

44.04

\pm

0.1

45.75

\pm

0.5

46.5

\pm

0.19

10%

51.24

\pm

0.04

26.84

\pm

0.05

53.19

\pm

0.06

44.16

\pm

0.78

45.16

\pm

0.4

44.56

\pm

0.1

52.01

\pm

0.2

50.65

\pm

0.3

53.81

\pm

0.23

30%

60.87

\pm

0.13

46.61

\pm

0.87

60.9

\pm

0.05

54.41

\pm

0.45

53.22

\pm

0.25

55.29

\pm

0.1

61.84

\pm

0.5

60.75

\pm

0.1

62.91

\pm

0.51

40%

62.13

\pm

0.38

53.88

\pm

0.23

62.39

\pm

0.91

58.45

\pm

0.92

56.45

\pm

0.1

60.08

\pm

0.15

62.48

\pm

0.02

61.04

\pm

0.5

63.51

\pm

0.75

50%

64.11

\pm

0.12

59.14

\pm

0.41

63.18

\pm

0.05

60.11

\pm

0.6

59.46

\pm

0.45

62.58

\pm

0.25

64.31

\pm

0.04

64.92

\pm

0.65

65.78

\pm

1.03

65%

65.21

\pm

0.03

64.23

\pm

0.2

65.24

\pm

0.02

64.42

\pm

0.01

61.28

\pm

0.25

64.98

\pm

0.1

65.17

\pm

0.1

67.01

\pm

0.1

68.02

\pm

0.38

70%

68.81

\pm

0.04

65.22

\pm

0.02

67.85

\pm

0.9

66.57

\pm

0.03

64.23

\pm

0.89

65.18

\pm

0.03

68.81

\pm

0.1

68.91

\pm

0.01

68.91

\pm

0.01

75%

68.41

\pm

0.02

67.42

\pm

0.01

68.01

\pm

0.1

67.12

\pm

0.02

65.45

\pm

0.2

67.13

\pm

0.03

69.01

\pm

0.03

69.42

\pm

0.05

69.42

\pm

0.05

80%

68.12

\pm

0.03

68.02

\pm

0.02

68.81

\pm

0.5

68.15

\pm

0.03

66.95

\pm

0.25

68.9

\pm

0.01

69.93

\pm

0.02

69.93

\pm

0.02

69.93

\pm

0.02

85%

68.75

\pm

0.05

68.01

\pm

0.03

68.81

\pm

0.5

68.91

\pm

0.04

67.17

\pm

0.1

68.9

\pm

0.2

69.91

\pm

0.25

69.93

\pm

0.02

69.93

\pm

0.02

90%

69.1

\pm

0.78

69.92

\pm

0.4

70.04

\pm

0.52

69.23

\pm

0.5

68.81

\pm

0.3

69.01

\pm

0.2

70.12

\pm

0.03

70.12

\pm

0.03

70.12

\pm

0.03

95%

69.91

\pm

0.06

69.91

\pm

0.04

69.91

\pm

0.04

70.12

\pm

0.01

69.91

\pm

0.04

70.12

\pm

0.03

69.91

\pm

0.1

70.12

\pm

0.03

70.12

\pm

0.03

Table 5. Table 5 : Performance (top-1 accuracy) of optimization and training property-based coreset methods on ImageNet-1k trainsplit. The coresets were selected and finetuned on ResNet-18. The full trainset performance was 69.91 ± 0.01 \mathbf{69.91\pm 0.01} . The mean accuracy over 5 runs, along with their standard deviation, is reported.

Coreset

Sizes

𝚁𝚊𝚗𝚍𝚘𝚖

𝙲𝚁𝙰𝙸𝙶

𝙶𝚕𝚒𝚜𝚝𝚎𝚛

𝙶𝚛𝚊𝚙𝚑𝙲𝚞𝚝

𝚂𝚕𝚘𝙲𝚞𝚛𝚟

𝚃𝙳𝙳𝚂

𝙳𝚢𝚗 - 𝚄𝚗𝚌

𝙳𝚄𝙰𝙻

𝐂𝐋𝐃

0.1%

0.7

\pm

0.03

0.94

\pm

0.1

0.86

\pm

0.01

1.09

\pm

0.09

1.23

\pm

0.06

1.04

\pm

0.2

0.45

\pm

0.9

0.41

\pm

1.06

1.96

\pm

0.7

0.5%

3.98

\pm

0.19

6.41

\pm

0.25

5.55

\pm

0.05

7.27

\pm

0.03

5.89

\pm

0.07

5.64

\pm

0.03

1.45

\pm

1.05

1.34

\pm

0.87

7.16

\pm

0.63

1%

7.86

\pm

0.43

15.56

\pm

0.05

12.45

\pm

0.01

14.27

\pm

0.31

14.17

\pm

0.02

14.05

\pm

0.1

3.92

\pm

0.8

4.85

\pm

0.75

15.92

\pm

0.41

5%

39.78

\pm

0.23

39.95

\pm

0.01

42.19

\pm

0.03

39.8

\pm

0.6

40.1

\pm

0.14

40.54

\pm

0.05

15.98

\pm

0.5

16.2

\pm

0.4

46.5

\pm

0.19

10%

51.24

\pm

0.04

46.76

\pm

0.1

50.1

\pm

0.01

48.27

\pm

1.02

46.39

\pm

0.5

46.45

\pm

0.25

29.78

\pm

0.6

50.75

\pm

0.5

53.81

\pm

0.23

30%

60.87

\pm

0.13

55.41

\pm

0.05

58.53

\pm

0.05

61.23

\pm

0.01

57.19

\pm

0.01

57.24

\pm

0.1

50.16

\pm

0.5

60.19

\pm

0.03

62.91

\pm

0.51

40%

62.13

\pm

0.38

56.55

\pm

0.02

61.72

\pm

0.02

63.23

\pm

0.08

62.11

\pm

0.67

62.23

\pm

0.5

60.25

\pm

0.2

63.45

\pm

0.25

63.51

\pm

0.75

50%

64.11

\pm

0.12

58.56

\pm

0.5

63.41

\pm

0.51

65.17

\pm

0.05

64.78

\pm

0.13

63.96

\pm

0.81

63.17

\pm

0.1

65.21

\pm

0.04

65.78

\pm

1.03

65%

65.21

\pm

0.03

63.41

\pm

0.2

65.83

\pm

0.02

67.91

\pm

0.54

65.81

\pm

0.02

65.19

\pm

0.1

65.98

\pm

0.25

68.31

\pm

0.01

68.02

\pm

0.38

70%

68.81

\pm

0.04

65.21

\pm

0.01

67.91

\pm

0.03

68.52

\pm

0.04

67.18

\pm

0.03

67.25

\pm

0.01

66.32

\pm

0.1

68.76

\pm

0.5

68.91

\pm

0.01

75%

68.41

\pm

0.02

68.01

\pm

0.5

68.54

\pm

0.46

68.81

\pm

0.02

68.03

\pm

0.15

68.14

\pm

0.25

68.19

\pm

0.01

69.92

\pm

0.01

69.42

\pm

0.05

80%

68.12

\pm

0.03

68.01

\pm

0.5

68.78

\pm

0.02

69.12

\pm

0.05

69.1

\pm

0.14

68.91

\pm

0.2

68.19

\pm

0.01

69.92

\pm

0.01

69.93

\pm

0.02

85%

68.75

\pm

0.05

68.78

\pm

0.02

68.78

\pm

0.02

70.01

\pm

0.28

69.1

\pm

0.02

68.91

\pm

0.25

69.93

\pm

0.2

69.93

\pm

0.2

69.93

\pm

0.02

90%

69.1

\pm

0.78

68.48

\pm

0.25

69.18

\pm

0.34

70.12

\pm

0.43

69.68

\pm

0.03

69.68

\pm

0.03

70.12

\pm

0.03

70.12

\pm

0.03

70.12

\pm

0.03

95%

69.91

\pm

0.06

69.91

\pm

0.04

69.91

\pm

0.04

70.12

\pm

0.43

69.91

\pm

0.04

69.68

\pm

0.03

70.12

\pm

0.03

70.12

\pm

0.03

70.12

\pm

0.03

Table 6. Table 6 : Cross-architecture performance of coresets of different sizes of ImageNet-1k identified by 𝙲𝙻𝙳 \mathtt{CLD} . Each cell reports mean test accuracy (top) and standard deviation (bottom) over 5 runs. Minimal accuracy drop ( < 1 % <1\% ) is observed when transferring coresets from a smaller ResNet-18 model to larger or different architectures.

Target Model	Source Model	Coreset size (% of dataset)
Target Model	Source Model	5%	10%	25%	40%	50%	75%	80%	100%
ResNet-34	ResNet-18	46.91 $\pm$ 0.03	54.83 $\pm$ 0.05	60.21 $\pm$ 0.12	66.83 $\pm$ 0.48	68.93 $\pm$ 0.01	71.12 $\pm$ 0.01	73.04 $\pm$ 0.05	73.21 $\pm$ 0.01
ResNet-34	ResNet-34	47.01 $\pm$ 0.02	54.75 $\pm$ 0.35	60.98 $\pm$ 0.71	67.26 $\pm$ 0.10	69.03 $\pm$ 0.04	71.06 $\pm$ 0.02	73.04 $\pm$ 0.01	73.21 $\pm$ 0.01
ResNet-50	ResNet-18	47.19 $\pm$ 0.01	56.14 $\pm$ 0.05	62.83 $\pm$ 0.13	68.14 $\pm$ 0.01	71.15 $\pm$ 0.03	73.04 $\pm$ 0.07	74.95 $\pm$ 0.02	75.81 $\pm$ 0.05
ResNet-50	ResNet-50	48.10 $\pm$ 0.05	57.03 $\pm$ 0.03	63.14 $\pm$ 0.02	68.91 $\pm$ 0.05	71.25 $\pm$ 0.04	73.57 $\pm$ 0.01	74.95 $\pm$ 0.02	75.81 $\pm$ 0.05
DenseNet-121	ResNet-18	45.18 $\pm$ 0.37	55.18 $\pm$ 0.61	61.19 $\pm$ 0.01	67.34 $\pm$ 0.04	71.02 $\pm$ 0.83	71.85 $\pm$ 0.04	73.85 $\pm$ 0.02	74.90 $\pm$ 0.37
DenseNet-121	DenseNet-121	46.07 $\pm$ 0.02	55.88 $\pm$ 0.03	62.01 $\pm$ 0.72	67.91 $\pm$ 0.61	71.36 $\pm$ 0.10	72.18 $\pm$ 0.07	74.04 $\pm$ 0.04	74.90 $\pm$ 0.37
VGG-19(bn)	ResNet-18	43.12 $\pm$ 0.04	53.18 $\pm$ 0.03	60.12 $\pm$ 0.02	66.17 $\pm$ 0.07	70.37 $\pm$ 0.01	72.01 $\pm$ 0.04	73.98 $\pm$ 0.37	74.70 $\pm$ 0.71
VGG-19(bn)	VGG-19(bn)	44.01 $\pm$ 0.50	54.12 $\pm$ 0.47	61.01 $\pm$ 0.31	67.09 $\pm$ 0.16	70.25 $\pm$ 0.25	72.58 $\pm$ 0.02	74.05 $\pm$ 0.04	74.70 $\pm$ 0.71

Equations251

θ arg min R_{D} (θ) = θ arg min E_{z \sim D} [ℓ (θ, z)] .

θ arg min R_{D} (θ) = θ arg min E_{z \sim D} [ℓ (θ, z)] .

θ arg min \hat{R} (θ, S) = θ arg min (\frac{1}{N} m = 1 \sum N ℓ (θ, z_{m})) .

θ arg min \hat{R} (θ, S) = θ arg min (\frac{1}{N} m = 1 \sum N ℓ (θ, z_{m})) .

G_{V} (θ_{S}^{t}) = \frac{1}{Q} j = 1 \sum Q \nabla_{θ} ℓ (θ_{S}^{t}, q_{j})

G_{V} (θ_{S}^{t}) = \frac{1}{Q} j = 1 \sum Q \nabla_{θ} ℓ (θ_{S}^{t}, q_{j})

Δ (z) = (ℓ (θ_{S}^{1}, z) - ℓ (θ_{S}^{0}, z), \dots, ℓ (θ_{S}^{T}, z) - ℓ (θ_{S}^{T - 1}, z)) \in R^{T} .

Δ (z) = (ℓ (θ_{S}^{1}, z) - ℓ (θ_{S}^{0}, z), \dots, ℓ (θ_{S}^{T}, z) - ℓ (θ_{S}^{T - 1}, z)) \in R^{T} .

Δ_{V}^{'} = (\frac{1}{Q} j = 1 \sum Q [ℓ (θ_{S}^{1}, q_{j}) - ℓ (θ_{S}^{0}, q_{j})], \dots, \frac{1}{Q} j = 1 \sum Q [ℓ (θ_{S}^{T}, q_{j}) - ℓ (θ_{S}^{T - 1}, q_{j})]) \in R^{T} .

Δ_{V}^{'} = (\frac{1}{Q} j = 1 \sum Q [ℓ (θ_{S}^{1}, q_{j}) - ℓ (θ_{S}^{0}, q_{j})], \dots, \frac{1}{Q} j = 1 \sum Q [ℓ (θ_{S}^{T}, q_{j}) - ℓ (θ_{S}^{T - 1}, q_{j})]) \in R^{T} .

CLD (z_{m}) : = ρ (Δ_{m}, Δ_{V}^{'}),

CLD (z_{m}) : = ρ (Δ_{m}, Δ_{V}^{'}),

Δ_{V, c}^{'} : =

Δ_{V, c}^{'} : =

where V_{c}

CLD (z_{m}) : = ρ (Δ (z_{m}), Δ_{V, c}^{'}) \forall z_{m} : y_{m} = c .

CLD (z_{m}) : = ρ (Δ (z_{m}), Δ_{V, c}^{'}) \forall z_{m} : y_{m} = c .

C = c = 1 ⋃ C C_{c}, C_{c} = Top - k_{c} ({z_{m} \in S : y_{m} = c}, CLD),

C = c = 1 ⋃ C C_{c}, C_{c} = Top - k_{c} ({z_{m} \in S : y_{m} = c}, CLD),

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L}{2} ∥ y - x ∥_{2}^{2}, \forall x, y \in R^{p} .

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L}{2} ∥ y - x ∥_{2}^{2}, \forall x, y \in R^{p} .

\frac{1}{∣ C ∣} m \in C \sum \nabla_{θ} ℓ (θ, z_{m})_{2} \leq B, ∥ G_{V} (θ) ∥_{2} \leq B,

\frac{1}{∣ C ∣} m \in C \sum \nabla_{θ} ℓ (θ, z_{m})_{2} \leq B, ∥ G_{V} (θ) ∥_{2} \leq B,

∥ G_{V} (θ) - \nabla_{θ} R_{D} (θ) ∥_{2} \leq δ, where δ = O (B / Q) .

∥ G_{V} (θ) - \nabla_{θ} R_{D} (θ) ∥_{2} \leq δ, where δ = O (B / Q) .

C = {z_{m} \in S : CLD (z_{m}) \geq 1 - ϵ}, ϵ > 0, ϵ \to 0,

C = {z_{m} \in S : CLD (z_{m}) \geq 1 - ϵ}, ϵ > 0, ϵ \to 0,

0 \leq t < T min \nabla_{θ} R_{D} (θ_{C}^{t})_{2}^{2} \leq \frac{2 [ R _{D} ( θ _{C}^{0} ) - R _{i n f} ]}{η T} + L η B^{2} + (B 2 κ + δ)^{2},

0 \leq t < T min \nabla_{θ} R_{D} (θ_{C}^{t})_{2}^{2} \leq \frac{2 [ R _{D} ( θ _{C}^{0} ) - R _{i n f} ]}{η T} + L η B^{2} + (B 2 κ + δ)^{2},

0 \leq t < T min \nabla_{θ} R_{D} (θ_{S}^{t})_{2}^{2} \leq \frac{2 [ R _{D} ( θ _{S}^{0} ) - R _{i n f} ]}{η T} + L η B^{2} .

0 \leq t < T min \nabla_{θ} R_{D} (θ_{S}^{t})_{2}^{2} \leq \frac{2 [ R _{D} ( θ _{S}^{0} ) - R _{i n f} ]}{η T} + L η B^{2} .

Compute_{CLD} = 3 N T_{proxy} f + Q T_{proxy} f + 3 k T f_{large},

Compute_{CLD} = 3 N T_{proxy} f + Q T_{proxy} f + 3 k T f_{large},

cos (∠ (\nabla_{θ} ℓ (θ_{S}^{t}, z_{m}), G_{V} (θ_{S}^{t}))) \geq 1 - ϵ_{t}^{'},

cos (∠ (\nabla_{θ} ℓ (θ_{S}^{t}, z_{m}), G_{V} (θ_{S}^{t}))) \geq 1 - ϵ_{t}^{'},

(Δ (z_{m}))_{t} = ℓ (θ_{S}^{t}, z_{m}) - ℓ (θ_{S}^{t - 1}, z_{m}) \approx ⟨ \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}), δ θ^{t - 1} ⟩ = x_{t},

(Δ (z_{m}))_{t} = ℓ (θ_{S}^{t}, z_{m}) - ℓ (θ_{S}^{t - 1}, z_{m}) \approx ⟨ \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}), δ θ^{t - 1} ⟩ = x_{t},

(Δ_{V}^{'})_{t} = \frac{1}{Q} j = 1 \sum Q (ℓ (θ_{S}^{t}, q_{j}) - ℓ (θ_{S}^{t - 1}, q_{j})) \approx ⟨ G_{V} (θ_{S}^{t - 1}), δ θ^{t - 1} ⟩ = y_{t} .

(Δ_{V}^{'})_{t} = \frac{1}{Q} j = 1 \sum Q (ℓ (θ_{S}^{t}, q_{j}) - ℓ (θ_{S}^{t - 1}, q_{j})) \approx ⟨ G_{V} (θ_{S}^{t - 1}), δ θ^{t - 1} ⟩ = y_{t} .

⟨ \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}), δ θ^{t - 1} ⟩ = c ⟨ G_{V} (θ_{S}^{t - 1}), δ θ^{t - 1} ⟩ + K^{'} .

⟨ \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}), δ θ^{t - 1} ⟩ = c ⟨ G_{V} (θ_{S}^{t - 1}), δ θ^{t - 1} ⟩ + K^{'} .

⟨ \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}) - c G_{V} (θ_{S}^{t - 1}), δ θ^{t - 1} ⟩ = K^{'} .

⟨ \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}) - c G_{V} (θ_{S}^{t - 1}), δ θ^{t - 1} ⟩ = K^{'} .

⟨ \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}) - c G_{V} (θ_{S}^{t - 1}), δ θ^{t - 1} ⟩ = 0.

⟨ \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}) - c G_{V} (θ_{S}^{t - 1}), δ θ^{t - 1} ⟩ = 0.

w_{t - 1} = \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}) - c G_{V} (θ_{S}^{t - 1}) = 0 .

w_{t - 1} = \nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}) - c G_{V} (θ_{S}^{t - 1}) = 0 .

\nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}) = c G_{V} (θ_{S}^{t - 1}) .

\nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}) = c G_{V} (θ_{S}^{t - 1}) .

cos (γ_{t - 1}) = cos (∠ (\nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}), G_{V} (θ_{S}^{t - 1}))) = 1.

cos (γ_{t - 1}) = cos (∠ (\nabla_{θ} ℓ (θ_{S}^{t - 1}, z_{m}), G_{V} (θ_{S}^{t - 1}))) = 1.

cos (γ_{t - 1}) \geq 1 - ϵ_{t - 1}^{'},

cos (γ_{t - 1}) \geq 1 - ϵ_{t - 1}^{'},

C = {z_{m} \in S : CLD (z_{m}) \geq 1 - ϵ}, with ∣ C ∣ = k, ϵ > 0, ϵ \to 0.

C = {z_{m} \in S : CLD (z_{m}) \geq 1 - ϵ}, with ∣ C ∣ = k, ϵ > 0, ϵ \to 0.

cos (∠ (\nabla_{θ} ℓ (θ_{C}^{t}, z_{m}), G_{V} (θ_{C}^{t}))) \geq 1 - κ,

cos (∠ (\nabla_{θ} ℓ (θ_{C}^{t}, z_{m}), G_{V} (θ_{C}^{t}))) \geq 1 - κ,

∥ ω_{m} ∥_{2}

∥ ω_{m} ∥_{2}

∥ ω_{V} ∥_{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

**Manish Nagaraj ***[email protected]

Electrical and Computer Engineering

Purdue University ***Deepak Ravikumar ***[email protected]

Electrical and Computer Engineering

Purdue University ***Kaushik Roy ***[email protected]

Electrical and Computer Engineering

Purdue University*

Abstract

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences ( $\mathtt{CLD}$ ), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. $\mathtt{CLD}$ is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for $\mathtt{CLD}$ -based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, $\mathtt{CLD}$ -based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. $\mathtt{CLD}$ transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with $<1\%$ degradation. Moreover, $\mathtt{CLD}$ is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, $\mathtt{CLD}$ exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make $\mathtt{CLD}$ a principled, efficient, stable, and transferable tool for scalable dataset optimization. 111The code is available on GitHub.

1 Introduction

Deep learning models rely on large and diverse datasets to achieve state-of-the-art performance across a wide range of tasks. However, training on such datasets is increasingly constrained by compute and memory budgets, especially in real-time or resource-limited settings. This raises a fundamental question: Which subsets of data most effectively support generalization? A natural answer is offered by coresets, compact, representative subsets of training data that retain full-dataset performance when used for training.

Coresets support a range of applications including active learning (coreset_activelearning1), neural architecture search (coreset_nas1; coreset_nas2), dataset distillation (coreset_dc), and continual learning (coreset_cl_1; bilevelcoresets_borsos2020). However, most existing approaches are either based on heuristic criteria unrelated to generalization (forgetting_toneva2018; initial_beloudah2020; icarl_rebuffi2017), or expensive second-order or bilevel optimization (gradmatch_killamsetty2021; tracin_pruthi2020; slocurves_garg2023; glister_killamsetty2021; refinedcoreset_xia2024; bilevelcoresets_borsos2020), limiting scalability.

We propose a simple, scalable, and theoretically grounded alternative to coreset generation using a metric we define as the Correlation of Loss Differences ( $\mathtt{CLD}$ ). This metric quantifies how closely the loss trajectory of a training sample aligns with the average validation loss trajectory during training (Figure˜1, left). Since the validation set reflects the test distribution, high positive $\mathtt{CLD}$ samples are likely to contribute positively to generalization. Selecting such samples yields compact coresets that preserve, or even improve, test accuracy by filtering out ambiguous or harmful examples (Figure˜1, right).

Beyond simplicity, $\mathtt{CLD}$ provides strong theoretical guarantees. We prove that training on high- $\mathtt{CLD}$ samples achieves convergence in population risk with an error bound that closely matches full-data training, where the excess error is explicitly governed by the sample alignment parameter $\kappa$ and the validation representativeness $\delta$ (see Theorem˜1). Our theory reveals that high $\mathtt{CLD}$ is not just sufficient but also necessary to minimize the convergence error-bound under coreset training.

We validate these findings across CIFAR-100 and ImageNet-1k, where $\mathtt{CLD}$ -selected coresets typically outperform or closely match state-of-the-art methods across a wide range of coreset sizes, and remain within 1% of more computationally expensive baselines even when not leading. Unlike these methods, $\mathtt{CLD}$ requires only per-sample loss values, allowing it to avoid the costly gradients, Jacobians, or feature embeddings used by many influence- and similarity-based selectors. Concretely, the only computational overhead beyond a standard training run on the full dataset of size $N$ is a single forward pass at each checkpoint over a small, held-out validation set of size $Q$ . Since the validation set is typically much smaller than the training set ( $Q\ll N$ ), this lightweight approach yields significant gains in both computational and storage efficiency, which we quantify in our full cost analysis summarized in Table˜1.

An additional strength of $\mathtt{CLD}$ is its robustness; the metric remains stable when computed using sparsely sampled training checkpoints and is consistent across random seeds, making it practical for large-scale or budgeted deployments. Furthermore, $\mathtt{CLD}$ coresets transfer effectively across architectures. This is one of the key advantages of $\mathtt{CLD}$ . Coresets computed using small proxy models (e.g., ResNet-18) generalize to larger models (e.g., ResNet-50, DenseNet) with performance drops consistently under $1\%$ . Hence we can compute coresets using a smaller proxy model (e.g., ResNet-18) and reuse it to train larger target models (e.g., ResNet-50/VGG/DenseNet) with minimal loss in accuracy, while substantially reducing selection cost.

Summary of contributions:

Correlation of Loss Differences for Coreset Selection. We introduce $\mathtt{CLD}$ , a simple and scalable metric for coreset construction based on the correlation between a training sample’s loss differences and the average validation loss trajectory, serving as a proxy for generalization, in Section˜4. 2. 2.

Theoretical Guarantees. We develop a general convergence framework showing that training on high- $\mathtt{CLD}$ samples yields population risk close to full-data training, with the suboptimality explicitly governed by sample alignment and validation representativeness, in Section˜5. 3. 3.

Experimental Validation. We show that on CIFAR-100 and ImageNet-1k, $\mathtt{CLD}$ -selected coresets typically outperform or closely match state-of-the-art methods across a wide range of subset sizes, and remain within 1% of more expensive baselines when not leading, in Section˜6. 4. 4.

Efficiency, Transferability, and Stability. $\mathtt{CLD}$ avoids gradient and curvature computations, incurs minimal compute and storage cost, transfers across architectures via proxy models, and remains stable under checkpoint subsampling and random seeds, making it highly practical for large-scale settings. We discuss these in Section˜7 and Section˜8.

2 Related Literature

The need for scalable coreset methods has led to a variety of approaches, which can be broadly grouped into score-based, optimization-based, and training property-based methods.

Score-based methods select training samples according to predefined metrics, often in the feature space or based on model prediction confidence. Some methods score a sample by its distance to a class center (icarl_rebuffi2017; endtoend_castro2018; initial_beloudah2020), to the class feature median (e.g., $\mathtt{Moderate}$ (xia2022moderate)), or to other samples (sener2018active). $\mathtt{Cal}$ (cal_margatina2021) identifies contrastive examples via the KL divergence between predictive distributions, while $\mathtt{Herding}$ (herding_chen2010) selects representative samples using a kernel-based approach in Hilbert space. Other methods rely on prediction probabilities from a (possibly proxy) model. For example, $\mathtt{Forgetting}$ (forgetting_toneva2018) counts how often a sample is misclassified after previously being correct, $\mathtt{GraNd}$ and $\mathtt{EL2N}$ (E2LN_grand_paul2021) rank examples by their early-training loss gradient norm or $l_{2}$ error norm, respectively, and $\mathtt{AUM}$ (pleiss2020_aum) scores the area under the margin across training to flag potential label issues. While these approaches are simple and avoid expensive gradient or Hessian computation, they often inherit biases from their heuristic scoring rules and offer no convergence guarantees. To mitigate bias, methods such as $\mathtt{CCS}$ (zheng_ccs) employ stratified sampling for diversity, while $\mathbb{D}^{2}$ - $\mathtt{Pruning}$ (modelpred_d2_maharana2023) combines difficulty (prediction variance) and diversity (feature density) in a graph-based framework. Nonetheless, such strategies often make restrictive assumptions to avoid noisy samples (e.g., $\mathtt{CCS}$ discards up to $30\%$ of data) and still lack theoretical generalization guarantees.

Optimization-based methods formulate coreset selection as an explicit optimization problem, often with provable convergence guarantees. $\mathtt{CRAIG}$ (craig_mirzasoleiman2020) and $\mathtt{GradMatch}$ (gradmatch_killamsetty2021) select samples that align with the full-data gradient direction, while $\mathtt{Glister}$ (glister_killamsetty2021) maximizes held-out validation log-likelihood. Bilevel optimization (bilevelcoresets_borsos2020) has been used to leverage influence functions (Koh2017) for selecting samples with maximal generalization benefit, and $\mathtt{GraphCut}$ (graphcut_iyer2021) uses submodular information measures as the objective. Recently, $\mathtt{BoundarySet}$ - $\mathtt{CCS}$ (mindboundary_yang2024) minimized decision boundary reconstruction error while ensuring class diversity. While theoretically grounded, these approaches often require repeated optimization loops, making them computationally expensive for large datasets.

Training property-based methods exploit the dynamics of training to assess sample importance. $\mathtt{SloCurv}$ (slocurves_garg2023) uses second-order loss curvature statistics to identify samples with better generalization potential, and $\mathtt{TracIn}$ (tracin_pruthi2020) tracks gradient alignment with a validation set. $\mathtt{TDDS}$ (zhang2024_tdds) extends this idea by projecting each sample’s gradient onto the accumulated gradient to quantify its true contribution, and by monitoring this projection across multiple iterations to account for fluctuations in importance over time. Although effective, such methods depend on costly first- or second-order statistics, limiting scalability.

Scalable training property-based methods reduce this overhead by measuring per-sample dynamics using only forward-pass signals. $\mathtt{Dyn}$ - $\mathtt{Unc}$ (uncertainity_he2024_dynunc) summarizes the variability of the true-class probability over sliding windows and prunes by thresholding aggregated uncertainty after training. $\mathtt{DUAL}$ (cho2025_DUAL) combines this uncertainty score with a difficulty term and uses a pruning-ratio–adaptive Beta sampling schedule to reweight selection at high pruning ratios. In contrast, our method $\mathtt{CLD}$ ranks examples by how well their loss-difference trajectories align with the class-wise validation loss trajectory, providing an explicit generalization-alignment criterion. All three approaches admit identical minimal logging, one scalar per example per checkpoint (probability for $\mathtt{Dyn}$ - $\mathtt{Unc}$ / $\mathtt{DUAL}$ ; loss for $\mathtt{CLD}$ ), so compute and storage are directly comparable. Unlike uncertainty-only methods, however, $\mathtt{CLD}$ ’s alignment objective supports a convergence guarantee (Section˜5), and it avoids ratio-specific sampling schedules and their hyperparameters; we report robustness at extreme cases (e.g., coresets of size $5\%$ ) under identical logging (Section˜6). In short, $\mathtt{CLD}$ combines the practicality of score-based metrics (low compute/storage via scalar logging) with a validation-aligned training-dynamics signal that admits a convergence guarantee.

3 Preliminaries and Problem Setup

We consider the supervised learning problem of learning a mapping from the input space to the output space, ${\bm{\mathsfit{X}}}\to{\bm{\mathsfit{Y}}}$ , where ${\bm{\mathsfit{X}}}\subseteq\mathbb{R}^{d}$ and ${\bm{\mathsfit{Y}}}\subseteq\mathbb{R}$ . The training dataset ${\mathbf{S}}$ consists of $N$ samples drawn from an unknown distribution ${\mathbf{D}}={\bm{\mathsfit{X}}}\times{\bm{\mathsfit{Y}}}$ , with each sample denoted as $\vec{z}_{i}=(\vec{x}_{i},y_{i})$ . Thus, ${\mathbf{S}}=\{\vec{z}_{1},\dots,\vec{z}_{N}\}$ . Additionally, we assume access to a query (held-out validation) set ${\mathbf{V}}\sim{\mathbf{D}}^{Q}=\{\vec{q}_{1},\dots,\vec{q}_{Q}\}$ containing $Q$ samples, which represents the true distribution ${\mathbf{D}}$ .

A learning algorithm ${\mathbf{A}}$ (e.g., SGD) is used to train a model with parameters $\theta\in\mathbb{R}^{p}$ on the training set ${\mathbf{S}}$ over $T$ iterations. We denote the model parameters at iteration $t$ of training on ${\mathbf{S}}$ as $\theta_{{\mathbf{S}}}^{t}$ , where $\theta_{{\mathbf{S}}}^{0}$ corresponds to the random initialization prior to the first update. The performance of the model at step $t$ on a sample $\vec{z}_{m}$ is evaluated using a loss function $\ell(\theta_{{\mathbf{S}}}^{t},\vec{z}_{m}):\mathbb{R}^{p}\times\mathbb{R}^{d}\to\mathbb{R}$ , which quantifies the prediction error on $\vec{z}_{m}$ at that point in training.

The goal of training is to minimize the population risk $R_{{\mathbf{D}}}(\theta)$ ,

[TABLE]

However, since ${\mathbf{D}}$ is unknown, we instead minimize the empirical risk $\hat{R}(\theta,{\mathbf{S}})$ ,

[TABLE]

The gradient of the loss with respect to the parameters $\theta$ at step $t$ for a sample $\vec{z}_{i}$ is denoted as $\nabla_{\theta}\ell(\theta_{{\mathbf{S}}}^{t},\vec{z}_{i})$ . The average gradient over the validation set ( $G_{{\mathbf{V}}}$ ) is,

[TABLE]

4 Correlation of Loss Differences (CLD)

We now define the core quantity used in our method, the correlation of per-sample loss trajectories with the validation set.

Loss Trajectories

For every sample $\vec{z}$ we record the per‑iteration change in loss during the model training run, and collect these $T$ increments in a loss‑difference trajectory

[TABLE]

The validation‑average trajectory is defined similarly as,

[TABLE]

Definition 1 (Correlation of Loss Differences ( $\mathtt{CLD}$ )).

The $\mathtt{CLD}$ score of a training sample $\vec{z}_{m}\in{\mathbf{S}}$ is the correlation between the sample’s loss trajectory $\vec{\Delta}_{m}$ and the average loss trajectory of the validation set ${\mathbf{V}}$ :

[TABLE]

where $\rho$ is the correlation metric. In our experiments, we employ Pearson correlation (pearsoncorr) due to its scale invariance and computational simplicity. Intuitively, the $\mathtt{CLD}$ metric quantifies how well a training sample’s loss dynamics align with the aggregate loss trajectory of the validation set, which serves as a proxy for generalization behavior. Samples with higher $\mathtt{CLD}$ values are deemed more influential and can thus be prioritized for coreset construction. We investigate this hypothesis both theoretically and empirically in the following sections.

While Definition˜1 defines $\mathtt{CLD}$ using a global validation trajectory, our practical implementation uses class-specific averages to ensure semantic alignment; see Section˜4.1.

4.1 Coreset Selection Procedure

We first train a source model $\theta_{{\mathbf{S}}}$ on the full dataset ${\mathbf{S}}$ , recording per-epoch losses for all training and validation samples. This logging piggybacks on the standard training loop; no additional passes over the training set are required. Consequently, all training samples are scored at the chosen checkpoints. We then compute $\vec{\Delta}_{m}$ and the class-specific average validation trajectory $\vec{\Delta}^{\prime}_{{\mathbf{V}},c}$ for each class $c$ .

[TABLE]

In accordance with standard practice for coreset selection, we score samples within each class independently. For a training sample $\vec{z}_{m}\in{\mathbf{S}}$ with label $y_{m}=c$ , its $\mathtt{CLD}$ score is the Pearson correlation between its trajectory and the corresponding class-specific validation trajectory:

[TABLE]

After computing all scores, we select the top- $k_{c}$ training samples in each class $c$ to form a class-balanced coreset

[TABLE]

with total size fixed in advance as $k=\sum_{c=1}^{C}k_{c}$ . This per-class selection strategy ensures both label balance and stability of dynamics within semantic categories, improving the robustness and interpretability of the resulting coreset.

A key advantage is architectural flexibility. $\mathtt{CLD}$ scores can be computed using a proxy model and transferred to larger or deeper architectures. The full coreset selection procedure is summarized in Appendix˜A.

5 Theoretical Analysis of CLD-Coresets

We now provide a theoretical justification for selecting high- $\mathtt{CLD}$ samples, showing that such coresets yield convergence guarantees close to full-data training under the following assumptions.

Assumption 1 ( $L$ -smoothness).

For every fixed sample $\vec{z}$ , define $f(\theta)\coloneqq\ell(\theta,\vec{z})$ . Then $f$ is $L$ -smooth in $\theta$ , i.e.,

[TABLE]

Consequently, both the population risk $R_{{\mathbf{D}}}(\cdot)$ and the empirical risk $\hat{R}(\cdot)$ are also $L$ -smooth.

Assumption 2 (Bounded Gradient Norm).

There exists $B>0$ such that, for all $\theta$ and every training sample $\vec{z}_{m}\in{\mathbf{S}}$ and validation sample $\vec{q}_{j}\in{\mathbf{V}}$ , $\left\lVert\nabla_{\theta}\ell(\theta,\vec{z}_{m})\right\rVert_{2}\leq B,$ and $\left\lVert\nabla_{\theta}\ell(\theta,\vec{q}_{j})\right\rVert_{2}\leq B.$ Consequently, for any index set $\mathcal{C}\subseteq\{1,\ldots,N\}$ ,

[TABLE]

where $\displaystyle G_{{\mathbf{V}}}(\theta)\coloneqq\frac{1}{Q}\sum_{j=1}^{Q}\nabla_{\theta}\ell(\theta,\vec{q}_{j})$ is the validation-average gradient.

Assumption 3 (Validation Representativeness).

With probability at least $1-\delta^{\prime}$ , the validation gradient $G_{{\mathbf{V}}}(\cdot)$ at every iterate $\theta$ encountered during training satisfies

[TABLE]

These assumptions mirror those commonly adopted in analyses of training dynamics (tracin_pruthi2020; datamodels_ilyas2022; 2020_datarepresentativeness_validassump; 2022_datarepresentivity_validassump; 2022_deepactive_validassump); we merely state Assumption˜3 explicitly for transparency, even though it is typically invoked implicitly and is widely regarded as reasonable.

Remark 1 (Per-Class Validation Trajectories).

In our implementation, we compute $\mathtt{CLD}$ using class-specific validation trajectories $\vec{\Delta}^{\prime}_{{\mathbf{V}},c}$ rather than a single global trajectory. This refinement aligns with standard coreset practices that enforce class balance, which reduces the variance of the correlation estimates by matching each training sample with the validation subset most relevant to its semantic label. The theoretical guarantees stated here continue to hold, as long as the per-class validation subsets satisfy the representativeness condition in Assumption˜3 when interpreted class-conditionally.

Theorem 1 (Convergence with $\mathtt{CLD}$ -Coresets).

Consider a gradient descent algorithm trained over $T$ iterations on a training dataset ${\mathbf{S}}$ with a held-out validation set ${\mathbf{V}}$ . Given Assumptions˜1, 2 and 3, let the learning rate satisfy $0<\eta\leq 1/L$ . Let $\theta_{{\mathbf{C}}}^{t}$ denote the parameters at iteration $t$ when training on a coreset ${\mathbf{C}}$ .

Then, training on the coreset ${\mathbf{C}}$ , consisting of samples with high $\mathtt{CLD}$ scores:

[TABLE]

guarantees that

[TABLE]

where $R_{\inf}:=\inf_{\theta}R_{{\mathbf{D}}}(\theta)$ , and $\kappa\geq 0$ is an alignment-gap term that quantifies the mismatch between the average coreset gradient and the validation proxy gradient $G_{{\mathbf{V}}}(\theta)$ along training (see Appendix˜B for the formal definition). Intuitively, $\kappa$ decreases as the selected samples’ $\mathtt{CLD}$ scores increase and as the coreset grows, and $\kappa\to 0$ as $\epsilon\to 0$ .

Proof Sketch.

The proof is based on the observation that the change in population risk across training steps can be approximated by the inner product between the gradient of the risk and the update direction. A high $\mathtt{CLD}$ score implies a strong correlation between a sample’s loss-change trajectory and that of the validation set, which in turn suggests consistent alignment between the sample’s gradient and the validation gradient.

Due to the $L$ -smoothness of the loss function, this alignment persists even when training on the coreset, allowing us to bound the cosine similarity between the average coreset gradient and the true risk gradient. This leads to a controlled approximation error in the optimization update. The total error term $(B\sqrt{2\kappa}+\delta)^{2}$ is governed by three factors: the $\mathtt{CLD}$ scores of selected samples, the deviation between the coreset and full-data parameter trajectories, and the quality of the validation set as a proxy for the true distribution. Full details and supporting lemmas are provided in Appendix˜B. ∎

Interpreting the Theory

Under the stated assumptions, training on the full dataset ${\mathbf{S}}$ yields the convergence bound

[TABLE]

Theorem˜1 shows that training on a high- $\mathtt{CLD}$ coreset achieves a similar bound, up to an additive deviation term $(B\sqrt{2\kappa}+\delta)^{2}$ . This deviation captures the alignment of the coreset with the validation dynamics ( $\kappa$ ), and the representativeness of the validation set ( $\delta$ ), with $\delta=O(B/\sqrt{Q})$ decreasing in the validation size $Q$ , and $\kappa$ decreasing as the CLD selection is tightened (smaller $\epsilon$ ) or as the coreset size $k$ increases (see Remark˜3 and Remark˜4 in Appendix˜B).

The alignment term $\kappa$ reflects both the informativeness of selected samples and the size of the coreset. Higher $\mathtt{CLD}$ scores indicate stronger agreement with validation loss trajectories and thus tighter gradient alignment, reducing $\kappa$ . Additionally, larger coresets more faithfully approximate full-data training dynamics, also lowering $\kappa$ . We note that making $k$ extremely small can increase trajectory deviation $\|\delta_{t}\|$ , which is reflected inside $\kappa$ (Remark 4). When the coreset size is fixed, the theorem implies that selecting higher- $\mathtt{CLD}$ samples improves convergence by minimizing this deviation. Thus, $\mathtt{CLD}$ -based selection emerges as a principled and necessary criterion for preserving the optimization behavior of full-data training.

Corollary 1 (Necessity of High $\mathtt{CLD}$ for Good Coresets).

Under the hypotheses of Theorem˜1, achieving convergence rates comparable to full-data training necessarily requires that the selected samples exhibit near-maximal $\mathtt{CLD}$ scores and that the validation set provides a reliable proxy for the true risk gradient. Fulfilling these necessary conditions ensures the optimization dynamics induced by the coreset remain well-aligned with those of full-data training.

6 Experimental Evaluation

We evaluate $\mathtt{CLD}$ empirically, focusing on its effectiveness and transferability.

Experimental Setup

We benchmark on CIFAR-100 (cifar) and ImageNet-1k (imagenet). CIFAR-100 has $50{,}000$ training and $10{,}000$ test images across $100$ classes; ImageNet-1k has $\sim\!1.28$ M training images and a $50{,}000$ -image validation set across $1{,}000$ classes. For each random seed, we form a classwise held-out validation split from the training data ( $10\%$ for CIFAR-100; $1\%$ for ImageNet-1k), ensuring equal per-class representation; a different split is generated per seed, and the resulting train/validation partitions are reused across all baselines for fairness. Unless otherwise specified, ResNet-18 (resnet) is the default architecture for $\mathtt{CLD}$ scoring and for training on selected coresets. Coresets are constructed per seed in a class-balanced manner by selecting, within each class, the top-ranked samples under $\mathtt{CLD}$ . Subset sizes range from $0.2\%$ – $100\%$ on CIFAR-100 and $0.1\%$ – $100\%$ on ImageNet-1k. We report the mean and standard deviation over $5$ independent seeds.

Baselines. We compare against representative state-of-the-art methods from three families: score-based ( $\mathtt{Forgetting}$ (forgetting_toneva2018), $\mathtt{EL2N}$ (E2LN_grand_paul2021), and $\mathtt{CCS}$ (zheng_ccs) using $\mathtt{AUM}$ (pleiss2020_aum)), optimization-based ( $\mathtt{Glister}$ (glister_killamsetty2021), $\mathbb{D}^{2}$ - $\mathtt{Pruning}$ (modelpred_d2_maharana2023)), and training-property–based ( $\mathtt{TDDS}$ (zhang2024_tdds), $\mathtt{SloCurv}$ (slocurves_garg2023), $\mathtt{DUAL}$ (cho2025_DUAL)), plus $\mathtt{Random}$ . We use implementations from the DeepCore (deepcore) library when available, and otherwise rely on official GitHub repositories. All methods are run under a consistent training setup (40 pretraining epochs where required), without any additional fine-tuning or regularization. To ensure fairness, all baselines, including ours, select and train coresets using the same backbone (ResNet-18).

Transferability protocol. On ImageNet-1k we additionally test cross-architecture transfer. We compute Transfer coresets using ResNet-18 and apply them to ResNet-34, ResNet-50, VGG-19, and DenseNet-121. We compare this to an Oracle setting where each target model computes its own $\mathtt{CLD}$ scores and coresets from its dynamics.

Results and Observations

Figure˜2 summarizes performance on CIFAR-100 and ImageNet-1k compared to other methods. Across both datasets, $\mathtt{CLD}$ consistently matches or outperforms the strongest baselines from each family. The most competitive alternatives are $\mathbb{D}^{2}$ - $\mathtt{Pruning}$ and $\mathtt{DUAL}$ , though both degrade at very small coreset sizes. On CIFAR-100, $\mathtt{Glister},\;\mathtt{DUAL},\;\mathtt{CCS}$ ( $\mathtt{AUM}$ ) can slightly edge out $\mathtt{CLD}$ at larger subsets (by $<\!1\%$ ), whereas $\mathtt{CLD}$ consistently leads on ImageNet-1k. At large subset sizes, $\mathtt{CLD}$ converges to full-data performance with negligible deviation from the strongest baseline. Full numerical tables (and additional methods beyond those plotted) are deferred to Appendix˜C to avoid clutter. A complementary analysis of the subset fraction required to match full-data accuracy is presented in Section˜D.2.

For cross-architecture transfer on ImageNet-1k, Figure˜3 shows that Transfer coresets selected with ResNet-18 closely track Oracle coresets computed by the target model itself. The gap remains below $1\%$ across ResNet-34, ResNet-50, VGG-19, and DenseNet-121 and across coreset sizes, including transfers across architecture families (ResNet $\rightarrow$ DenseNet/VGG).

Takeaways

$\mathtt{CLD}$ achieves near-optimal accuracy across subset sizes while incurring the lowest compute and storage overhead among strong baselines, and its coresets transfer effectively from lightweight proxies to larger targets. Together, these properties make $\mathtt{CLD}$ a scalable, reliable choice for coreset selection in both single-architecture and cross-architecture regimes.

7 Computational and Storage Efficiency

Beyond accuracy, a practical coreset method should keep both compute and storage costs low. CLD does so by relying only on per-sample loss scalars that standard training already produces; no per-sample gradients, Hessians, or pairwise similarities are required. Practically, this entails scoring the full training set at a small number of checkpoints, but the scores are exactly the per-sample losses computed during training, i.e., no extra inference sweeps. The only extra work is a forward-only sweep over a small held-out query set each proxy epoch ( $Q\ll N$ ) to record query losses. In contrast, gradient/adversarial methods incur extra backward passes, while similarity/nearest-neighbor methods require feature extraction and large feature caches. We quantify the end-to-end compute cost (selection plus training on the selected coreset) and the storage overhead in detail in Appendix˜E and summarize the symbolic complexity below.

Notation and setup.

We measure compute in floating-point operations (FLOPs) and report storage overheads:

•

Data and epochs. $N$ training samples, $Q$ query samples; $T$ epochs for the large model, $T_{\text{proxy}}$ for the proxy; $T_{\text{early}}$ (early scoring), $T_{\text{proxy,early}}$ (early proxy epochs in DUAL).

•

Model cost convention. Large model forward cost $f_{\text{large}}$ , proxy forward cost $f$ with $f\ll f_{\text{large}}$ . One backward $\approx 2$ forwards $\Rightarrow$ one training step $\approx 3$ forwards per example.

•

Subset/problem. $k$ coreset size; $d$ input dimension; $c$ classes; $R$ repeats (restarts/probes); $\gamma$ reselection interval.

•

CRAIG embeddings. $F$ penultimate-feature dimension; $D_{\text{eff}}\!\coloneqq\!F{+}c$ is the embedding size used by CRAIG.

•

Method-specific. $J$ window length (Dyn-Unc/DUAL/TDDS); $H$ message-passing rounds ( $\mathbb{D}^{2}$ -Pruning); $\kappa$ $k$ NN degree; $U$ unlabeled-pool size (Cal); $\gamma_{\text{anc}}$ anchor spacing and $A{=}T_{\text{proxy}}/\gamma_{\text{anc}}$ anchors (CRAIG); $\epsilon$ stochastic-greedy tolerance; $\lambda$ trade-off in GraphCut.

Results and observations (compute).

As summarized in Table˜1, methods that score during early training of the large model (e.g., Forgetting, EL2N, GraNd) require one or more full sweeps over all $N$ examples with the large network for $T_{\text{early}}$ epochs (and sometimes $R$ repeats), so their selection cost includes terms like $3NT_{\text{early}}R\,f_{\text{large}}$ , making them compute-inefficient even if the coreset used later is small. Optimization-with-reselection methods (e.g., Glister) add frequent subset updates every $\gamma$ epochs, driving $\mathcal{O}\big((kQ+N\log(1/\epsilon))\,f_{\text{large}}\,T/\gamma\big)$ on top of $3kT\,f_{\text{large}}$ . Feature/graph–based selectors (Herding, Moderate, $\mathbb{D}^{2}$ -Pruning, Cal) pay $3NT_{\text{proxy}}f$ plus at least one $Nf$ encoding pass (sometimes graph/ $k$ NN work). By contrast, CLD uses only proxy training and cheap per-epoch query forwards:

[TABLE]

with no gradient/Hessian sweeps, no adversarial steps, and no pairwise similarities.

Results and observations (storage).

To make storage comparisons transparent, Table˜1 reports selection-stage storage overhead only, i.e., method-specific extras beyond storing the large model’s weights. Early-training methods (Forgetting, EL2N, GraNd, AUM) need only $\mathcal{O}(N)$ scalars; windowed-uncertainty methods (Dyn-Unc, DUAL, TDDS) add $\mathcal{O}(NJ)$ logs. CRAIG stores $\mathcal{O}\!\big(N(F{+}c)\big)$ embeddings, similarity/feature methods cache $\mathcal{O}(Nd)$ (plus $\mathcal{O}(N\kappa)$ graphs), and GraphCut is $\mathcal{O}(N^{2})$ . CLD uses only scalar loss logs $\mathcal{O}\!\big((N{+}Q)T_{\text{proxy}}\big)$ .

A visual summary.

Figure 4 summarizes the trade-off between accuracy (y-axis) and end-to-end compute (x-axis, log scale), with bubble size proportional to the selection-stage storage overhead. Points in the upper-left with small bubbles are closest to the “Pareto-efficient” frontier, combining high accuracy with low compute and storage cost. To provide a concrete, quantitative context for these trade-offs, the plot is generated using an illustrative setup: selecting $10\%$ coresets of ImageNet-1k with a ResNet-18 proxy, then training a ResNet-50 on the chosen coreset (see Appendix˜E for details). In this setting, methods that score during early training of the large model (e.g., GraNd, EL2N) and those with frequent reselection (e.g., Glister) appear far to the right due to large compute costs, while feature/similarity-based selectors (Herding, Moderate, Cal, $\mathbb{D}^{2}$ -Pruning) have large bubbles from $\mathcal{O}(Nd)$ feature caches. CLD lies near the efficient frontier: its selection cost is proxy-only plus lightweight query forwards, and its storage is just scalar loss logs. We also show $\mathtt{CLD}_{90}$ (scores derived using loss values from all proxy epochs) and $\mathtt{CLD}_{45}$ (first 45 epochs only); the latter cuts selection compute nearly in half while preserving accuracy (discussed in Section˜8). This mirrors observations for DUAL, which also achieves minimal accuracy drop by using only early proxy epochs, underscoring that temporal truncation can further improve efficiency without sacrificing performance. For clarity, the figure uses shades of blue for score-based methods, shades of orange for optimization-based methods, shades of green for training-property-based methods, and dark red for $\mathtt{CLD}$ .

8 Discussion

We discuss practical considerations and empirical findings that further illustrate the applicability of $\mathtt{CLD}$ .

Stability under temporal subsampling.

We first assess robustness to reduced temporal resolution on ImageNet-1k with ResNet-18. $\mathtt{CLD}$ is computed either from the first 30 or 45 checkpoints (out of 90) or from trajectories subsampled at $2\times$ or $3\times$ lower frequency, while training still runs for all 90 epochs. As shown in Figure˜5(a), using only early checkpoints (30/45) yields accuracy nearly identical to the full-trajectory setting, indicating that the informative signal is captured early in training. $2\times$ subsampling preserves accuracy, and even $3\times$ subsampling incurs only a minor degradation due to reduced temporal resolution.

Bias reduction and stratified sampling.

Several methods incorporate bias-reduction mechanisms, such as $\mathtt{CCS}$ (zheng_ccs), which stratifies selection across score percentiles to promote diversity. In contrast, $\mathtt{CLD}$ leverages per-class validation trajectories, yielding a generalization-aware signal that naturally balances classes and downweights noisy or redundant samples. As shown in Figure˜5(b), applying $\mathtt{CCS}$ -style stratified sampling on top of $\mathtt{CLD}$ scores consistently reduces accuracy across coreset sizes on CIFAR-100. This contrasts with metrics like $\mathtt{AUM}$ , where $\mathtt{CCS}$ can be beneficial; for $\mathtt{CLD}$ , percentile quotas perturb its validation-aligned ranking and reintroduce less informative points.

Validation proxy: composition and size matter.

Our theoretical guarantees for $\mathtt{CLD}$ require that the validation set be a faithful proxy for the test distribution (Assumption˜3 in Section˜5). To understand this, we ask: How sensitive is $\mathtt{CLD}$ to the composition of the validation set, and does bias in this validation set affect downstream coreset quality? To probe this, we split CIFAR-100’s $50\text{k}$ training set into a classwise $25\text{k}/25\text{k}$ train/pool partition. From the pool, we constructed $5000$ example validation sets using five heuristics based on publicly available memorization scores (FZ_infl_feldman2020), denoted $\mathtt{mem}$ . Intuitively, low- $\mathtt{mem}$ points correspond to canonical, stereotypical examples (often helpful for transfer and exploited by methods such as SloCurv), while high- $\mathtt{mem}$ points surface atypical or mislabeled examples. These heuristics, therefore, let us bias the validation set toward “typical” or “atypical” regions of the data. We trained ResNet-18 proxies on the $25\text{k}$ train split, computed $\mathtt{CLD}$ with respect to each heuristic-based validation set, built class-balanced coresets, and fine-tuned ResNet-18 across coreset sizes (five seeds; see Figure˜6(b)). As shown in Figure˜6(a), the non-random heuristics yield validation sets with markedly different $\mathtt{mem}$ profiles. Two clear patterns emerge. First, validation sets biased toward Highest- $\mathtt{mem}$ examples consistently degrade performance across coreset sizes: overrepresenting atypical or mislabeled samples lowers accuracy and hinders learning. Second, validation sets built from Lowest- $\mathtt{mem}$ examples generally support stronger generalization, but retaining some high- $\mathtt{mem}$ points is beneficial for capturing long-tail behavior likely present in the test distribution. In our runs, Proportional sampling, drawing examples according to the pool’s original $\mathtt{mem}$ distribution, was the most reliable overall, and would likely improve further if mislabeled points were filtered or downweighted (which we did not do in this study). These findings underscore that $\mathtt{CLD}$ ’s effectiveness depends critically on the quality of the validation signal: coresets cannot exceed the fidelity of the validation dynamics they are aligned with. This motivates future work on principled procedures to build clean, representative validation sets (e.g., robust to label noise) and to select validation sizes that balance reliability with efficiency.

Overall, $\mathtt{CLD}$ already achieves implicit bias reduction via validation alignment; external stratified sampling is unnecessary and often counterproductive. We provide further results, including seed-wise stability (Section˜D.1) and connections to influence functions and training data attribution methods (Appendix˜F), in the appendix.

Loss Differences as a Gradient-Free Proxy for Influence.

The $\mathtt{CLD}$ metric is built on correlating per-example loss differences, a deliberate choice over simpler signals like raw losses or gradient norms. This choice is theoretically motivated: as we formalize in Lemma˜1, a first-order expansion shows that the loss difference $\Delta\ell(\theta^{t};z)$ approximates the gradient inner product $\langle\nabla_{\theta}\ell(\theta^{t-1};z),\,\delta\theta^{t-1}\rangle$ . This insight positions $\mathtt{CLD}$ as an efficient proxy for the alignment dynamics tracked by influence methods that compute gradient similarities, such as TracIn (tracin_pruthi2020). The primary advantage of this proxy approach is computational efficiency. Methods like TracIn are powerful but expensive, requiring the storage and processing of high-dimensional per-sample gradients. $\mathtt{CLD}$ captures the same underlying alignment dynamics while avoiding this overhead entirely. Simpler signals, such as raw losses or gradient norms, are even cheaper but conceptually flawed, as they are confounded by signal drift or discard crucial directional information. Empirically, $\mathtt{CLD}$ ’s performance is comparable to TracIn on key attribution metrics, including the Linear Datamodeling Score (a measure of how well scores correlate with true sample influence) and prediction brittleness (a measure of how critical top-ranked samples are to model predictions). See Appendix˜F for the full analysis and a broader discussion of how $\mathtt{CLD}$ relates to influence methods.

Beyond Supervised Vision: Scope and Caveats.

Because $\mathtt{CLD}$ requires only per-example training losses and a small validation proxy, its core criterion can be adapted to other supervised settings (e.g., contrastive learning with a validation loss, or object detection by aggregating per-instance losses to a per-image score). In this work, we intentionally limit our scope to supervised image classification for controlled comparisons and do not claim empirical validation outside this domain. Applying $\mathtt{CLD}$ to modern language model fine-tuning, however, is non-trivial due to challenges like the “squeezing effect” (ren2025_learning). As fine-tuning progresses, the model concentrates probability mass onto the exact training sequences, which can paradoxically lower the assigned probabilities of similar but non-identical validation samples. This causes a divergence between training and validation loss trajectories, meaning that training samples promoting generalization are not guaranteed to have a high $\mathtt{CLD}$ score. Corroborating this finding, xia2024_LESS observe that, unlike in vision, minimizing validation loss does not reliably improve model performance in instruction tuning. Adapting a $\mathtt{CLD}$ -style selector for LLMs will therefore require developing task-appropriate generalization proxies, which is an important direction for future work.

9 Limitations

While $\mathtt{CLD}$ is scalable and effective, it does have limitations. First, it requires access to training loss trajectories across multiple checkpoints, which may not be feasible in settings where models are deployed as black boxes or when fine-tuning from pretrained checkpoints without full retraining. Second, although $\mathtt{CLD}$ requires a hold-out validation set, this reduces the number of samples for training.

10 Conclusion

We introduced Correlation of Loss Differences ( $\mathtt{CLD}$ ), a simple, scalable metric for identifying data that aligns with generalization. By relying only on per-sample losses across training checkpoints, without gradients, pairwise similarities, or second-order information, $\mathtt{CLD}$ enables principled coreset selection with low compute and storage cost. Across CIFAR-100 and ImageNet-1k, $\mathtt{CLD}$ matches or outperforms state-of-the-art methods over a wide range of subset sizes and attains full-data accuracy with substantially smaller subsets. $\mathtt{CLD}$ -selected coresets also transfer across architectures (ResNet, VGG, DenseNet) with $<\!1\%$ degradation, remain stable when using only early checkpoints, and inherently reduce bias via per-class validation alignment, obviating additional stratified sampling. Our theory further shows that the convergence gap under coreset training is controlled by sample–validation alignment and the representativeness of the validation set. Taken together, these properties make $\mathtt{CLD}$ a practical tool for large-scale, budgeted training and a principled foundation for future work on robust validation design and budget-aware data selection.

Acknowledgements

This work was supported in part by the Center for the Co-Design of Cognitive Systems (CoCoSys), a DARPA-sponsored JUMP 2.0 center, the Semiconductor Research Corporation (SRC), the National Science Foundation, and Collins Aerospace. We are also thankful to Efstathia Soufleri, Akshita Gupta, Utkarsh Saxena, Amitangshu Mukherjee, and Sakshi Choudhary for their helpful discussions and feedback.

Appendix A CLD-Coreset Selection Algorithm

For completeness, we provide the full pseudocode for the coreset selection procedure described in Section˜4.1. This algorithm computes per-class $\mathtt{CLD}$ scores by correlating each training sample’s loss trajectory with the corresponding class-specific validation trajectory, and selects a fixed number of top-scoring samples per class to form a class-balanced coreset.

Implementation note. Loss values in Steps 3–6 are logged as part of the normal training loop; no additional passes are introduced.

Appendix B Detailed Theoretical Framework

In this appendix, we provide the complete theoretical framework supporting the results stated in Section˜5. We first outline the detailed lemmas establishing gradient alignment and approximation properties of $\mathtt{CLD}$ -selected coresets. We then conclude with a full proof of the convergence guarantee presented in Theorem˜1.

B.1 Roadmap and Notation

Proof Outline. Our main convergence guarantee, Theorem 1, is built upon three supporting lemmas:

•

Lemma 1, which establishes that a high $\mathtt{CLD}$ score implies strong alignment between a sample’s gradient and the validation gradient.

•

Lemma 2, which shows that this alignment is preserved during coreset training, provided the coreset and full-data parameter trajectories remain close.

•

Lemma 3, which leverages this stable alignment to bound the approximation error between the average coreset gradient and the true population risk gradient.

Notation Summary. We use the following symbols throughout our analysis:

•

$L$ : The smoothness constant of the loss function.

•

$B$ : An upper bound on the per-sample gradient norm.

•

$k$ : The size of the coreset.

•

$Q$ : The size of the validation set.

•

$\epsilon$ : A parameter controlling the strictness of CLD-based sample selection.

•

$\kappa$ : The alignment-gap term, quantifying the gradient mismatch from coreset selection.

•

$\delta$ : The validation representativeness error, which decays as $O(\sqrt{Q})$ .

•

$\eta$ : The learning rate (step size) of the optimizer.

•

$T$ : The total number of training iterations.

•

$\theta_{S}^{t},\theta_{C}^{t}$ : Model parameters at iteration $t$ when training on the full dataset and the coreset, respectively.

B.2 Supporting Lemmas for Theorem˜1

Lemma 1 (High $\mathtt{CLD}$ Implies Gradient Alignment).

Consider a training sample $\vec{z}_{m}\in{\mathbf{S}}$ with $\mathtt{CLD}(\vec{z}_{m})=\rho\!\bigl(\vec{\Delta}(\vec{z}_{m}),\vec{\Delta}^{\prime}_{{\mathbf{V}}}\bigr)\geq 1-\epsilon$ for some small $\epsilon>0$ . Let $\theta_{{\mathbf{S}}}^{t}$ be the parameters obtained by running algorithm $\mathcal{A}$ on ${\mathbf{S}}$ for $t$ iterations. Let $\delta\theta^{t-1}\coloneqq\theta_{{\mathbf{S}}}^{t}-\theta_{{\mathbf{S}}}^{t-1}$ be the parameter update at step $t$ .

Suppose the learning algorithm is run for a sufficiently large number of iterations $T$ . Assume the sequence of parameter updates $\{\delta\theta^{t-1}\}_{t=1}^{T}$ is sufficiently varied. This means the updates are not persistently orthogonal to any fixed non-zero vector direction in the relevant parameter subspace.

Then, for most training steps $t$ where $G_{{\mathbf{V}}}(\theta_{{\mathbf{S}}}^{t})\neq 0$ , the sample gradient $\nabla_{\theta}\ell(\theta_{{\mathbf{S}}}^{t},\vec{z}_{m})$ and the validation gradient $G_{{\mathbf{V}}}(\theta_{{\mathbf{S}}}^{t})$ are well-aligned:

[TABLE]

where $\epsilon^{\prime}_{t}\to 0$ as $\epsilon\to 0$ .

Proof Outline.

Use first-order loss changes to relate per-example loss differences and validation loss differences at consecutive steps. High correlation of differences implies an (approximately) positive linear link between their inner products with update directions, forcing the sample gradient to be a positive scalar multiple of the validation gradient under sufficiently varied updates. Continuity then yields $\cos\!\left(\angle(\nabla\ell(\cdot,\vec{z}_{m}),G_{{\mathbf{V}}})\right)\geq 1-\epsilon^{\prime}_{t}$ with $\epsilon^{\prime}_{t}\!\to\!0$ as $\epsilon\!\to\!0$ . ∎

Proof.

We first analyze the idealized case where the correlation is perfect ( $\epsilon=0$ ) and the underlying approximations hold exactly, and then argue by continuity.

Assume the first-order Taylor expansions are exact for the loss changes:

[TABLE]

Assume perfect correlation $\rho(\vec{x},\vec{y})=1$ .

This implies an exact positive linear relationship $x_{t}=c\,y_{t}+K^{\prime}$ for all $t$ , where $c=\sigma_{x}/\sigma_{y}>0$ and $K^{\prime}=\overline{x}-c\,\overline{y}$ .

Substituting the definitions of $x_{t}$ and $y_{t}$ :

[TABLE]

Rearranging yields:

[TABLE]

Since the mean of the loss trajectory will be smaller compared to the variance of the terms (losses eventually reduce to [math]), it is reasonable to assume $K^{\prime}\approx 0$ .

Thus, for $t=1,\dots,T$ :

[TABLE]

Let $\vec{w}_{t-1}\coloneqq\nabla_{\theta}\ell(\theta_{{\mathbf{S}}}^{t-1},\vec{z}_{m})-c\,G_{{\mathbf{V}}}(\theta_{{\mathbf{S}}}^{t-1})$ .

The vector $\vec{w}_{t-1}$ is exactly orthogonal to the update direction $\delta\theta^{t-1}$ at each step $t$ .

Now, invoke the assumption that the sequence of updates $\{\delta\theta^{t-1}\}_{t=1}^{T}$ is sufficiently varied.

This means the updates are not persistently orthogonal to any fixed non-zero direction $\vec{w}_{t-1}$ .

If $\vec{w}_{t-1}$ were non-zero, the variation in updates would eventually yield a $\delta\theta^{t-1}$ such that $\langle\vec{w}_{t-1},\delta\theta^{t-1}\rangle\neq 0$ .

Since the inner product is exactly zero for all $t$ in our idealized case, the only possibility consistent with the sufficient variation assumption is that $\vec{w}_{t-1}$ must be the zero vector. Thus:

[TABLE]

This signifies that the sample gradient is exactly a positive scalar multiple ( $c>0$ ) of the validation gradient:

[TABLE]

Consequently, the vectors are perfectly collinear and point in the same direction (assuming $G_{{\mathbf{V}}}(\theta_{{\mathbf{S}}}^{t-1})\neq 0$ ). The angle $\gamma_{t-1}$ between them is exactly [math]. Therefore, in this idealized case:

[TABLE]

This derivation holds under the ideal conditions ( $\epsilon=0$ , exact Taylor approx., $K^{\prime}=0$ ).

Since the involved operations are continuous, when the conditions are only approximately met (i.e., $\rho\geq 1-\epsilon$ with $\epsilon\to 0$ , Taylor approx. is good, $K^{\prime}$ is small), the resulting cosine similarity will be close to $1$ .

We express this conclusion as

[TABLE]

where the error $\epsilon^{\prime}_{t-1}\to 0$ as $\epsilon\to 0$ .

Assuming this alignment holds for most steps $t$ (implying alignment at step $t$ relies on properties at $t-1$ ), the lemma statement follows. ∎

Remark 2 (On Update Sequence Variation).

The assumption regarding the update sequence $\{\delta\theta^{t-1}\}$ is that it exhibits enough variation over the trajectory to ensure that no fixed non-zero vector can remain orthogonal to all updates. This property is weaker than requiring the updates to span the entire parameter space, but it is sufficient for the argument. It essentially prevents the gradient difference vector from hiding in a direction that the optimization process never explores. Stochastic optimization methods accumulating updates over many iterations (large $T$ ) are often expected to satisfy this sufficient variation condition.

Lemma 2 (Stability of Gradient Alignment).

Suppose the conditions in Theorem˜1 hold: specifically, $L$ -smoothness and bounded gradients ( $\left\lVert\nabla_{\theta}\ell(\theta,\vec{z})\right\rVert_{2}\leq B$ for all $\vec{z}$ ). Consider a coreset ${\mathbf{C}}$ constructed by selecting samples with high $\mathtt{CLD}$ scores:

[TABLE]

Assume that during training, the difference between the parameter trajectories satisfies $\left\lVert\theta_{{\mathbf{C}}}^{t}-\theta_{{\mathbf{S}}}^{t}\right\rVert_{2}=\left\lVert\delta_{t}\right\rVert_{2}$ at step $t$ .

Then, for each sample $\vec{z}_{m}\in{\mathbf{C}}$ , the cosine similarity between its gradient and the average validation gradient at step $t$ is lower bounded by

[TABLE]

where $\kappa=\epsilon^{\prime}_{t}+\frac{4L}{B}\left\lVert\delta_{t}\right\rVert_{2}+\frac{3L^{2}}{B^{2}}\left\lVert\delta_{t}\right\rVert_{2}^{2},$ and $\epsilon^{\prime}_{t}\to 0$ as $\epsilon\to 0$ .

Proof outline..

Compare gradients at $\theta_{{\mathbf{C}}}^{t}$ and $\theta_{{\mathbf{S}}}^{t}$ using $L$ -smoothness to bound deviations by $O(\|\delta_{t}\|)$ . Expand the inner product and bound cross-terms via Cauchy–Schwarz and the gradient-norm bound $B$ . Plug the base alignment from Lemma 1 at $\theta_{{\mathbf{S}}}^{t}$ to transfer alignment to $\theta_{{\mathbf{C}}}^{t}$ , yielding the stated $1-\kappa$ lower bound with linear/quadratic dependence on $\|\delta_{t}\|$ . ∎

Proof.

Let $\omega_{m}=\nabla_{\theta}\ell(\theta_{{\mathbf{C}}}^{t},\vec{z}_{m})-\nabla_{\theta}\ell(\theta_{{\mathbf{S}}}^{t},\vec{z}_{m})$ and $\omega_{\mathbf{V}}=G_{{\mathbf{V}}}(\theta_{{\mathbf{C}}}^{t})-G_{{\mathbf{V}}}(\theta_{{\mathbf{S}}}^{t})$ denote the deviations between gradients evaluated on the coreset trajectory and the full dataset trajectory.

By $L$ -smoothness of $\ell(\cdot,\vec{z}_{m})$ and of the validation-average loss $\hat{R}_{{\mathbf{V}}}(\theta)\coloneqq\frac{1}{Q}\sum_{j=1}^{Q}\ell(\theta,\vec{q}_{j})$ , we have:

[TABLE]

Expanding the inner product:

[TABLE]

Applying Cauchy–Schwarz inequality and the bounded gradient norm $\left\lVert\nabla_{\theta}\ell(\theta,\vec{z})\right\rVert_{2}\leq B$ , we have:

[TABLE]

Thus,

[TABLE]

The denominator is upper bounded by:

[TABLE]

Combining this result with Lemma˜1, we can conclude:

[TABLE]

where $\epsilon^{\prime}_{t}$ captures the initial alignment error when training on ${\mathbf{S}}$ . This completes the proof. ∎

Remark 3.

Lemma˜2* shows that if a sample’s gradient is well-aligned with the validation gradient during training on the full dataset (i.e., $\epsilon^{\prime}_{t}$ is small), then this alignment is preserved when training on a coreset ${\mathbf{C}}$ , as long as the parameter trajectories $\theta_{{\mathbf{S}}}^{t}$ and $\theta_{{\mathbf{C}}}^{t}$ remain close. The degradation in alignment is bounded by terms that are linear and quadratic in $\left\lVert\delta_{t}\right\rVert_{2}$ . Thus, as long as the coreset trajectory stays near the full dataset trajectory, the generalization-relevant properties captured by $\mathtt{CLD}$ remain stable. This stability is crucial for ensuring that $\mathtt{CLD}$ -based coresets maintain the training dynamics of the full dataset.*

Remark 4 (Influence of Coreset Size on $\kappa$ ).

It is important to explicitly consider how the coreset size $k=|{\mathbf{C}}|$ (as specified in Lemma˜2) influences the components of $\kappa=\epsilon^{\prime}_{t}+\frac{4L}{B}\left\lVert\delta_{t}\right\rVert_{2}+\frac{3L^{2}}{B^{2}}\left\lVert\delta_{t}\right\rVert_{2}^{2}$ .

•

The term $\epsilon^{\prime}_{t}$ , representing the initial alignment error derived from Lemma˜1, is affected by $k$ . A smaller coreset size $k$ allows for a more stringent selection criterion for samples based on their $\mathtt{CLD}$ scores. Specifically, one can choose only samples with $\mathtt{CLD}(\vec{z}_{m})$ very close to $1$ , which corresponds to a smaller $\epsilon$ in the selection rule $\mathtt{CLD}(\vec{z}_{m})\geq 1-\epsilon$ (from Theorem˜1 and Lemma˜2). A smaller $\epsilon$ naturally leads to a smaller $\epsilon^{\prime}_{t}$ .

•

Conversely, the terms in $\kappa$ that depend on $\left\lVert\delta_{t}\right\rVert_{2}=\left\lVert\theta_{{\mathbf{C}}}^{t}-\theta_{{\mathbf{S}}}^{t}\right\rVert_{2}$ (the deviation between coreset and full-data parameter trajectories) are also influenced by $k$ . While a smaller $k$ allows for higher individual sample quality, a very small $k$ might result in a coreset that is less representative of the full dataset ${\mathbf{S}}$ . This reduced representativeness can lead to a larger divergence $\left\lVert\delta_{t}\right\rVert_{2}$ during training, as the optimization trajectory on the small coreset may differ more substantially from that on the full data. An increase in $\left\lVert\delta_{t}\right\rVert_{2}$ would, in turn, increase the overall value of $\kappa$ .

Therefore, the selection of an appropriate coreset size $k$ involves an inherent trade-off. A smaller $k$ can be beneficial for the $\epsilon^{\prime}_{t}$ component of $\kappa$ by enabling the selection of higher-quality samples. However, if $k$ is too small, it could adversely affect the components of $\kappa$ related to $\left\lVert\delta_{t}\right\rVert_{2}$ by making the coreset insufficiently representative. The stability discussed in Remark˜3 relies on $\left\lVert\delta_{t}\right\rVert_{2}$ remaining small, highlighting the importance of $k$ being chosen to adequately approximate the full dataset’s training dynamics while leveraging the benefits of high $\mathtt{CLD}$ scores.

Lemma 3 (Subset-Gradient Approximation).

Suppose the conditions in Theorem˜1 hold, including $L$ -smoothness, bounded gradients, and validation representativeness as described in Section˜5.

Define the average coreset gradient at step $t$ as

[TABLE]

Then, for every training step $t$ , we have

[TABLE]

where $\kappa$ captures the alignment error and satisfies $\kappa\to 0$ as $\epsilon\to 0$ .

Proof outline..

Split the error as $\|\gamma_{{\mathbf{C}}}^{t}-G_{{\mathbf{V}}}\|+\|G_{{\mathbf{V}}}-\nabla R_{{\mathbf{D}}}\|$ . The second term is $\delta$ by validation representativeness. For the first, average the per-sample deviations and apply Jensen to get a mean of squared distances; combine with Lemma 2 to bound each by $2B^{2}\kappa$ , yielding $\|\gamma_{{\mathbf{C}}}^{t}-G_{{\mathbf{V}}}\|\leq B\sqrt{2\kappa}$ . ∎

Proof.

We decompose the error using the triangle inequality:

[TABLE]

The second term $\left\lVert G_{{\mathbf{V}}}(\theta_{{\mathbf{C}}}^{t})-\nabla_{\theta}R_{{\mathbf{D}}}(\theta_{{\mathbf{C}}}^{t})\right\rVert_{2}$ is bounded by $\delta$ by the validation representativeness assumption.

To bound the first term $\left\lVert\gamma_{{\mathbf{C}}}^{t}-G_{{\mathbf{V}}}(\theta_{{\mathbf{C}}}^{t})\right\rVert_{2}$ , we apply Jensen’s inequality:

[TABLE]

Define $\varphi_{m}^{t}$ as the angle between $\nabla_{\theta}\ell(\theta_{{\mathbf{C}}}^{t},\vec{z}_{m})$ and $G_{{\mathbf{V}}}(\theta_{{\mathbf{C}}}^{t})$ . By Lemma˜2, we have $\cos\varphi_{m}^{t}\geq 1-\kappa$ for all $m$ .

Expanding the squared distance:

[TABLE]

Since $\left\lVert\nabla_{\theta}\ell(\theta_{{\mathbf{C}}}^{t},\vec{z}_{m})\right\rVert_{2},\left\lVert G_{{\mathbf{V}}}(\theta_{{\mathbf{C}}}^{t})\right\rVert_{2}\leq B$ and $\cos\varphi_{m}^{t}\geq 1-\kappa$ , we have

[TABLE]

Substituting back into the Jensen bound,

[TABLE]

Taking square roots gives

[TABLE]

Thus, combining the two bounds,

[TABLE]

as claimed. ∎

Remark 5.

Lemma˜3* shows that under mild conditions, the average gradient computed over a coreset selected based on $\mathtt{CLD}$ remains close to the true risk gradient throughout training. The deviation is controlled by two sources: the alignment error $\kappa$ arising from the selection of high- $\mathtt{CLD}$ samples, and the validation approximation error $\delta$ due to finite sample size. Consequently, optimization over $\mathtt{CLD}$ -coresets closely tracks the gradient flow of the full dataset, ensuring that convergence and generalization properties are preserved. This result is crucial for connecting loss trajectory dynamics with practical coreset construction.*

B.3 Proof for Theorem˜1

Proof outline..

Apply the $L$ -smooth descent lemma to $R_{\mathbf{D}}(\theta)$ with the update $\theta_{{\mathbf{C}}}^{t+1}=\theta_{{\mathbf{C}}}^{t}-\eta\,\gamma_{{\mathbf{C}}}^{t}$ to obtain a one-step inequality involving $\langle\nabla R_{{\mathbf{D}}},\gamma_{{\mathbf{C}}}^{t}\rangle$ . Decompose $\gamma_{{\mathbf{C}}}^{t}=\nabla R_{{\mathbf{D}}}+(\gamma_{{\mathbf{C}}}^{t}-\nabla R_{{\mathbf{D}}})$ and bound the error term using Lemma 3: $E_{t}\!=\!\|\gamma_{{\mathbf{C}}}^{t}-\nabla R_{{\mathbf{D}}}\|\!\leq\!B\sqrt{2\kappa}+\delta$ . Control the mixed term by Cauchy–Schwarz + Young; bound $\|\gamma_{{\mathbf{C}}}^{t}\|$ by $B$ . Sum over $t$ to telescope $R_{t}-R_{t+1}$ and isolate $\min_{t}\|\nabla R_{{\mathbf{D}}}(\theta_{{\mathbf{C}}}^{t})\|^{2}$ , yielding the stated bound with $(B\sqrt{2\kappa}+\delta)^{2}$ and an $L\eta B^{2}$ residual. ∎

Proof.

Define

[TABLE]

By $L$ -smoothness of $\ell(\cdot)$ , and in extension $R_{\mathbf{D}}(\cdot)$ , and $\eta\leq 1/L$ ,

[TABLE]

Substituting the model update $\theta_{{\mathbf{C}}}^{t+1}-\theta_{{\mathbf{C}}}^{t}=-\eta\gamma_{{\mathbf{C}}}^{t}$ :

[TABLE]

Rearranging,

[TABLE]

Decomposing the inner product using the true gradient $G_{t}$ we get,

[TABLE]

Substituting this back,

[TABLE]

Let $E_{t}=\left\lVert\gamma_{{\mathbf{C}}}^{t}-G_{t}\right\rVert_{2}$ .

Lemma 3 under the stated assumptions gives the bound $E_{t}\leq B\sqrt{2\kappa}+\delta$ .

By Cauchy–Schwarz inequality,

[TABLE]

Young’s inequality states that $ab\leq a^{2}/(2\gamma)+\gamma b^{2}/2,\quad\forall a,b\geq 0\text{ and }\gamma>0$ .

By using this inequality with $\gamma=1$ ,

[TABLE]

Substituting this into the inequality in Equation˜55,

[TABLE]

Since all gradients are bounded by $B$ , their average $\left\lVert\gamma_{{\mathbf{C}}}^{t}\right\rVert_{2}$ is also bounded by $B$ .

[TABLE]

Summing from $t=0$ to $T-1$ :

[TABLE]

Let $R_{\inf}=\inf_{\theta}R_{\mathbf{D}}(\theta)$ . Then $R_{0}-R_{T}\leq R_{0}-R_{\inf}$ . Substituting the bound $E_{t}\leq B\sqrt{2\kappa}+\delta$ :

[TABLE]

The sum on the left is lower bounded by $T$ times the minimum term, i.e., $\sum_{t=0}^{T-1}\left\lVert G_{t}\right\rVert_{2}^{2}\geq T\cdot\min_{0\leq t<T}\left\lVert G_{t}\right\rVert_{2}^{2}$ .

[TABLE]

Since $\eta,T>0$ ,

[TABLE]

This proves the theorem. ∎

Appendix C Datasets, Models, and Experimental Details

We evaluate our method on two standard image classification benchmarks: CIFAR-100 (cifar) and ImageNet-1k (imagenet). CIFAR-100 consists of 50,000 training and 10,000 test images across 100 classes and is publicly available without licensing restrictions. For ImageNet-1k, we use the official release from https://image-net.org/download.php, which is provided under a standard academic research license and requires user agreement to the terms of access.

Architectures.

Our experiments employ the following CNN backbones:

•

ResNet-18, ResNet-34, and ResNet-50 resnet from https://pytorch.org/vision/stable/models/resnet.html (BSD-3-Clause license).

•

VGG-19 with batch normalization vgg from https://pytorch.org/vision/stable/models/vgg.html (BSD-3-Clause license).

•

DenseNet-121 densenet from https://pytorch.org/vision/stable/models/densenet.html (BSD-3-Clause license).

For baseline comparisons, we use the DeepCore library deepcore (https://github.com/PatrickZH/DeepCore), which provides standardized implementations of several coreset selection techniques and is licensed under MIT. All training and evaluation code was implemented in PyTorch; dependencies including torchvision are MIT/BSD licensed.

Training Setup for CIFAR-100.

All networks were trained using SGD sgd for $164$ epochs with an initial learning rate of $0.1$ , decayed by a factor of $0.1$ at epochs $81$ and $121$ . Nesterov momentum nesterov with momentum $0.9$ was used, along with weight decay $5\times 10^{-4}$ . Standard augmentations included resizing to $32\times 32$ , random cropping with padding $=4$ , random horizontal flips, and normalization.

Training Setup for ImageNet-1k.

Training followed standard ImageNet protocols: all models were trained with SGD for 90 epochs, with a learning rate of $0.1$ decayed by $0.1$ at epochs 30 and 60. Nesterov momentum with coefficient $0.9$ and weight decay $10^{-4}$ were used. Data augmentations included random resized cropping to $224\times 224$ , horizontal flipping, and normalization.

Reproducibility.

No fine-tuning or additional regularization was applied to any method, including ours, ensuring fairness in coreset comparisons. All methods used the same validation split as the validation proxy, and the same train split as the full training set available for each seed when scoring training samples. Each experiment was repeated across 5 independent runs with distinct seeds; reported results reflect the mean and standard deviation.

All experiments were conducted on a private compute cluster with access to NVIDIA A40 GPUs (48 GB memory, 300W TDP). All training and evaluation runs were performed in full precision using PyTorch.

Results.

Performance results for CIFAR-100 and ImageNet-1k coreset experiments are shown in Tables 2, 3, 4, and 5. Cross-architecture transferability results for $\mathtt{CLD}$ coresets, as discussed in Section˜6, are shown in Table˜6.

Appendix D Additional Ablations

D.1 Stability across random seeds

We measure sensitivity to random initialization by computing per-example $\mathtt{CLD}$ scores across five independent seeds on ImageNet-1k (ResNet-18). The pairwise mean absolute error (MAE) between score vectors is consistently below $10^{-5}$ , indicating negligible variance and high reproducibility; see Figure˜7(a).

D.2 Minimum subset size for full-data accuracy

We quantify the subset fraction required to recover near full-data performance on ImageNet-1k (ResNet-18). As shown in Figure˜7(b), $\mathtt{CLD}$ attains test accuracy within $0.5\%$ of the full-data model using only $75\%$ of the training set, on par with $\mathbb{D}^{2}$ - $\mathtt{Pruning}$ and $\mathtt{DUAL}$ , and superior to other baselines we evaluated.

Appendix E Detailed Explanation of Compute and Storage Cost of Coreset Methodologies

Recap of Notation

We denote the number of training samples by $N$ ( ${\mathbf{S}}\sim{\mathbf{D}}^{N}$ ) and the number of query (held-out validation) samples by $Q$ ( ${\mathbf{V}}\sim{\mathbf{D}}^{Q}$ ). The model is trained for $T$ epochs. Certain TDA metrics have hyperparameters (denoted $\lambda_{\tau}$ ) used to compute the TDA metric $\tau$ , which may influence computational cost.

We measure computation in floating-point operations (FLOPs). Let $f_{\text{large}}$ be the cost of a single-example forward pass for the large model with $p_{\text{large}}$ parameters, and approximate the backward-pass cost as $2f_{\text{large}}$ . When a proxy (smaller) model is used, we write its per-example forward cost and parameter count as $f$ and $p$ , with $f\ll f_{\text{large}}$ and $p\ll p_{\text{large}}$ .

$R$ is the number of model retrainings (when applicable). $k$ is the coreset size; $d$ is the feature dimensionality; and $B$ is the minibatch size (used during training but not appearing in per-example FLOP counts). Some methods perform subset reselection during training; when used, we denote the reselection interval (in epochs) by $\gamma$ .

Reference: full-data training (large model)

Training the large model on all $N$ points for $T$ epochs costs $3NT\,f_{\text{large}}$ FLOPs and stores $p_{\text{large}}$ parameters (ignoring optimizer state). Totals below are the end-to-end cost to (i) select a coreset of size $k$ and (ii) train the large model on that coreset.

Example scenario (used for all plug-in estimates).

To further illustrate the computational efficiency of $\mathtt{CLD}$ , we provide approximate cost values by substituting the values of the parameters for finding and training a $10\%$ coreset on the ImageNet-1k dataset.

•

$N{=}1{,}268{,}355$ (99% of train), $Q{=}12{,}812$ (remaining 1%), $d{=}224{\times}224{\times}3$ , $c{=}1000$ .

•

Size of coreset $k{=}126{,}836$ .

•

Proxy encoder: ResNet-18 with $p{=}11{,}689{,}128$ and per-example forward FLOPs $f{=}1{,}818{,}228{,}160$ .

•

Large model: ResNet-50 with $p_{\text{large}}{=}25{,}557{,}032$ and $f_{\text{large}}{\approx}8{,}178{,}000{,}000$ .

•

We use the standard ImageNet recipe of $T{=}T_{\text{proxy}}{=}90$ epochs (bearpaw_github).

•

When a method has additional parameters, we use the paper’s choice (e.g., in Glister, $\gamma{=}20$ ).

E.1 Score-based Methods

E.1.1 Kernel Herding (Herding)

Herding (herding_chen2010) iteratively constructs a representative subset by approximating the data distribution in an RKHS. At iteration $t$ , it selects

[TABLE]

where $\phi(\cdot)$ is the kernel feature map and $w_{t-1}$ is an RKHS weight vector. Repeating for $k$ iterations yields a size- $k$ coreset.

Execution. One-time selection prior to training the large model (a proxy encoder is used to obtain features):

i)

Train proxy encoder: train a proxy model for $T_{\text{proxy}}$ epochs (forward cost $f$ , parameters $p$ ). 2. ii)

Encode full dataset: extract features for all $N$ points using the trained proxy ( $Nf$ FLOPs). 3. iii)

Herding selection: for $t{=}1{:}k$ , update candidate scores and select $\vec{z}^{*t}$ using inner products in feature space (explicit features of dim. $d$ give $\mathcal{O}(Nd)$ per iteration, i.e., $\mathcal{O}(Ndk)$ total). 4. iv)

Train on coreset: train the large model on the selected $k$ points for $T_{\text{late}}$ epochs (here $T_{\text{late}}{=}T$ ).

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. Storing explicit features dominates; caching a $k{\times}k$ Gram among selected points is optional:

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{late}}{=}90$ , $T_{\text{proxy}}{=}90$ ).

[TABLE]

E.1.2 Example Forgetting (Forgetting)

Forgetting (forgetting_toneva2018) measures, for each training sample, how many times it transitions from being correctly classified to incorrectly classified during training (“forgetting events”). Examples with higher forgetting counts are ranked as more informative.

Execution. One-time selection prior to training the large model on the coreset. Let $T_{\text{early}}$ be the number of early epochs run on the full dataset to collect forgetting statistics, and $T_{\text{late}}\!=\!T-T_{\text{early}}$ the remaining epochs used to train on the selected coreset:

i)

Train on all $N$ points for $T_{\text{early}}$ epochs while tracking, per example, the previous correctness bit and a forgetting counter (constant-time update per visit). 2. ii)

Select the top $k$ examples by forgetting count; train the large model on this coreset for $T_{\text{late}}$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. Streaming the metric requires only one scalar counter (and one correctness bit) per training example:

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{early}}{=}10$ , $T_{\text{late}}{=}80$ ).

[TABLE]

E.1.3 Area Under Margin (AUM)

AUM (pleiss2020_aum) scores each training sample by aggregating its margin over training (e.g., logit of the true class minus the max non-true logit), producing the area under the margin across epochs/updates. Higher absolute AUM indicates more consistently confident predictions; lower AUM can flag ambiguous or noisy samples. We compute AUM with a proxy model and then train the large model on the selected coreset.

Execution. One-time selection with a proxy; the large model then trains on the coreset for all $T$ epochs:

i)

Train proxy & log margins: train a proxy for $T_{\text{proxy}}$ epochs on all $N$ samples (per-example forward cost $f$ , parameters $p$ ), recording each sample’s margin as it appears in training (no extra forward/backward beyond training). 2. ii)

Compute AUM & select: for each sample, aggregate (e.g., sum/average) its logged margins to obtain AUM and select a size- $k$ coreset according to the desired criterion (e.g., highest AUM, or filter low-AUM points). 3. iii)

Train large on coreset: train the large model on the selected $k$ samples for $T$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. AUM can be streamed with a running sum/count per sample:

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ ).

[TABLE]

E.1.4 Contrastive Active Learning (Cal)

Cal (cal_margatina2021) acquires unlabeled examples that are near labeled ones in feature space yet differ in predictive probabilities (contrastive pairs), using nearest neighbors over encoder features and a simple divergence-based ranking.

Execution. A one-time selection stage is performed prior to training the large model:

i)

train a proxy model from scratch on the (growing) labeled set up to size $k$ (per-example forward cost $f\ll f_{\text{large}}$ ); 2. ii)

encode all $U$ unlabeled points with the trained proxy (here $U{=}N$ ); 3. iii)

run $k$ NN-style neighbor search between the unlabeled pool and the labeled set (size $k$ ), and select a coreset of size $k$ .

(The divergence computation is much cheaper than (ii)–(iii) and is absorbed into big- $\mathcal{O}$ .)

Overall compute (select once, then train-on-coreset).

[TABLE]

Selection-stage storage overhead. Storage overhead is due to the cached features for the unlabeled pool

[TABLE]

Example scenario values: ImageNet-1k; $U{=}N$ ).

[TABLE]

E.1.5 Gradient and Error L2 Norm-based Data Pruning (GraNd, EL2N)

GraNd (E2LN_grand_paul2021) ranks training examples by the (expected) per-example gradient norm early in training:

[TABLE]

averaged over multiple random initializations and early epochs, then retains the top- $k$ examples.

EL2N (E2LN_grand_paul2021) ranks examples by the (expected) L2 error of predictions early in training:

[TABLE]

where $\mathbf{p}_{\theta}$ are class probabilities and $\mathbf{y}_{m}$ is the one-hot label. Scores are computed at small $t$ and optionally averaged over $R$ runs.

Execution. One-time selection prior to training the large model. Let $T_{\text{early}}$ be the number of early epochs used for scoring and $T_{\text{late}}\!=\!T-T_{\text{early}}$ the remaining epochs used to train on the coreset:

i)

For each of $R$ initializations, train on all $N$ points for $T_{\text{early}}$ epochs (costing $3NT_{\text{early}}f_{\text{large}}$ ) while logging per-example predictions.

(a)

EL2N: compute scores directly from the logged predictions (no extra passes). 2. (b)

GraNd: run an additional scoring sweep to obtain per-sample gradients (one forward + one backward pass per example per early epoch). 2. ii)

Average scores across runs; keep the top $k$ ; train the large model on the coreset for $T_{\text{late}}$ epochs.

End-to-end compute (selection + train-on-coreset).

[TABLE]

Selection-stage storage overhead. During scoring, a running vector of $N$ scalar scores need to be stored:

[TABLE]

Example scenario values (ImageNet-1k; $R{=}10$ , $T_{\text{early}}{=}10$ , $T_{\text{late}}{=}80$ ).

[TABLE]

E.1.6 Using Class Feature Medians (Moderate)

Moderate (xia2022moderate) builds a representative coreset by selecting, within each class, the samples whose feature-to-center distances are closest to that class’s median distance (thus avoiding both easy near-center redundancies and far-out outliers). We compute class centers and distances in a proxy feature space.

Execution. One-time selection with a proxy, then train the large model on the coreset for all $T$ epochs:

i)

Train proxy: train a proxy encoder on all $N$ samples for $T_{\text{proxy}}$ epochs (per-example forward cost $f$ , parameters $p$ ). 2. ii)

Encode dataset: extract proxy features for all $N$ samples (cost $Nf$ FLOPs). 3. iii)

Class-median selection: for each class, compute the class center and all sample distances, then select the per-class quota of samples whose distances are closest to the class-wise median (distance computation $\mathcal{O}(Nd)$ ; median/quantile selection $\mathcal{O}(N\log N)$ or linear-time selection). 4. iv)

Train large on coreset: train the large model on the selected $k$ samples for $T$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. The main storage overhead is from caching the features during selection

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ ).

[TABLE]

E.1.7 Message Passing ( $\mathbb{D}^{2}-\mathtt{Pruning}$ )

$\mathbb{D}^{2}\mathtt{Pruning}$ (modelpred_d2_maharana2023) selects a coreset by balancing difficulty and diversity via message passing on a dataset graph built from proxy features. Initial per-sample difficulty scores (from the proxy) are diffused over a $k$ NN graph so that each example’s score incorporates information from its neighbors; a graph-based sampler then selects a subset that covers diverse yet difficult regions.

Execution. One-time selection with a proxy, then train the large model on the coreset for all $T$ epochs:

i)

Train proxy: train a proxy encoder on all $N$ samples for $T_{\text{proxy}}$ epochs (per-example forward cost $f$ , parameters $p$ ). 2. ii)

Encode dataset: extract proxy features for all $N$ samples (cost $Nf$ FLOPs). 3. iii)

Build graph: construct a $k$ NN graph over the features (e.g., ANN); cost $\mathcal{O}(N\,k\,d)$ (or $\mathcal{O}(N\,d\log N)$ ). 4. iv)

Message passing & sampling: run $H$ rounds of (forward/reverse) message passing on the $N\!\times\!k$ edges to update difficulty-aware scores, then sample a size- $k$ coreset (cost $\mathcal{O}(H\,N\,k)$ plus linear-time sampling). 5. v)

Train large on coreset: train the large model on the selected $k$ samples for $T$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. Caching proxy features dominates; the $k$ NN adjacency is linear in $N$ and smaller in practice:

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ ).

[TABLE]

E.2 Optimization-based Methods

E.2.1 Gradient Matching Optimization (CRAIG)

CRAIG (craig_mirzasoleiman2020) selects a subset whose (aggregated) gradients closely match those of the full dataset across training, typically using a submodular (stochastic-greedy) objective over per-sample gradient embeddings at selected anchor epochs. We compute embeddings with a proxy model and then train the large model on the coreset for all $T$ epochs.

Execution. One-time selection with a proxy, then train the large model:

i)

Train proxy: train a proxy network on all $N$ samples for $T_{\text{proxy}}$ epochs (per-example forward cost $f$ , parameters $p$ ). 2. ii)

Per-sample gradient embeddings at anchors: every $\gamma_{\text{anc}}$ epochs (anchors $A{=}T_{\text{proxy}}/\gamma_{\text{anc}}$ ), compute for each sample the last-layer gradient embedding using only forward-pass outputs:

[TABLE]

where $h_{\theta}$ is the penultimate representation (dim. $F$ ) and $\mathbf{p}_{\theta}-\mathbf{y}_{i}\in\mathbb{R}^{C}$ is the class-probability error; this avoids backward passes (piggybacks on training). Selection then runs stochastic-greedy on these embeddings per anchor. 3. iii)

Select & train large: union the anchor-wise selections to a size- $k$ coreset and train the large model on it for all $T$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Here $D_{\text{eff}}\approx F{+}C$ is the embedding dimensionality (penultimate features and class-probability error); the submodular arithmetic is negligible in FLOPs relative to training and is kept in big- $\mathcal{O}$ .

Selection-stage storage overhead. We stream anchor processing so only a single anchor’s embeddings need be cached at once:

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ , $F{=}512$ , $C{=}1000$ ).

[TABLE]

E.2.2 Generalization-based Data Subset Selection for Efficient and Robust Learning (Glister)

Glister (glister_killamsetty2021) selects ${\mathbf{S}}_{j}$ of size $k$ via a mixed discrete–continuous bi-level objective:

[TABLE]

Execution. Glister replaces the training loop: it interleaves training on the current subset with periodic re-selection every $\gamma$ epochs (using stochastic-greedy with a Taylor approximation).

Overall compute (train-on-coreset). Training on the coreset over $T$ epochs costs $3kT\,f_{\text{large}}$ . Selection across $T$ epochs at frequency $\gamma$ costs $\mathcal{O}\!\left(\frac{\big(kQ+N\log(1/\epsilon)\big)\,f_{\text{large}}\,T}{\gamma}\right).$ Hence

[TABLE]

Selection-stage storage overhead. Storage overhead is from validation caches:

[TABLE]

Example scenario values:

[TABLE]

E.2.3 GraphCut-based Data Subset Selection (GraphCut)

GraphCut (graphcut_iyer2021) selects ${\mathbf{S}}_{j}$ via the generalized graph-cut function

[TABLE]

Execution. GraphCut adds a one-time selection stage prior to training (it does not replace training). The procedure is:

i)

train a proxy model; 2. ii)

extract features for all $N$ training points using the trained proxy (per-example cost $f\ll f_{\text{large}}$ ); 3. iii)

run (stochastic-)greedy selection to build a size- $k$ subset.

(Similarity operations are typically much cheaper than (ii)–(iii), so we absorb them into big- $\mathcal{O}$ .)

Overall compute (train-on-coreset). One-time selection (including proxy training) + large-model training:

[TABLE]

Selection-stage storage overhead. Storage overhead is from storing pairwise similarities

[TABLE]

Example scenario values:

[TABLE]

E.2.4 Reconstructing the Decision Boundary (BoundarySet-CCS)

BoundarySet-CCS (mindboundary_yang2024) selects samples near the model’s decision boundary and then enforces coverage across distance bands. Distance-to-boundary is approximated per sample by the minimum number of PGD steps required to flip its prediction; CCS (coverage-centric sampling) then allocates the coreset budget across bands to preserve distribution coverage.

Execution. One-time selection with a proxy, then train the large model on the coreset for all $T$ epochs:

i)

Train proxy: train a proxy network on all $N$ samples for $T_{\text{proxy}}$ epochs (per-example forward cost $f$ , parameters $p$ ). 2. ii)

Distance-to-boundary (PGD): for each sample, run projected gradient steps until misclassification (cap at $K_{\max}$ steps). If the stopping step is $k$ , define $d(x)=k$ . Each PGD step requires one forward & one backward; we use $3f$ per step in our convention. Let $\bar{K}\leq K_{\max}$ be the average steps per sample. 3. iii)

CCS selection: partition samples by $d(x)\in\{0,\dots,K_{\max}\}$ and allocate the size- $k$ budget across bands (linear-time bucketting and sampling). 4. iv)

Train large on coreset: train the large model on the selected $k$ points for all $T$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. Storage is mainly through the scalar distance per sample during selection:

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $K_{\max}{=}50$ so $\bar{K}{\approx}50$ , $T{=}90$ ).

[TABLE]

E.3 Training Property-based Methods

E.3.1 Samples with Low Loss Curvature (SloCurv)

SloCurv (slocurves_garg2023) scores each training sample by an input-loss curvature proxy computed at the end of (proxy) training. For a sample $\vec{z}_{m}$ , with model parameters $\theta^{T}$ and random Rademacher directions $v_{r}$ scaled by $h$ , the score is

[TABLE]

Samples with the lowest curvature are retained to form a size- $k$ coreset.

Execution. One-time selection prior to training the large model; a proxy model is used for scoring:

i)

Train proxy: train a proxy encoder with per-example forward FLOPs $f$ and parameters $p$ for $T_{\text{proxy}}$ epochs on all $N$ points. 2. ii)

Curvature scoring: at the end of proxy training, for each sample compute $\mathrm{Curv}(\vec{z}_{m};\theta^{T})$ using $R$ Hutchinson repeats. This requires $(R{+}1)$ gradient evaluations per sample (one at $\vec{z}_{m}$ and one for each $\vec{z}_{m}{+}hv_{r}$ ), each costing $\approx(1\text{ fwd }{+}\;1\text{ bwd})\approx 3f$ in our convention. 3. iii)

Train on coreset: select the $k$ lowest-curvature samples and train the large model on this coreset for $T_{\text{late}}{=}T$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. Storage overhead is from keeping a track of the running curvature values and the directions probed.

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T_{\text{late}}{=}90$ , $R{=}10$ ).

[TABLE]

E.3.2 Temporal Dual-Depth Scoring (TDDS)

TDDS (zhang2024_tdds) builds a coreset by combining two temporal depths of signal from training with a proxy model. Depth 1 computes, for each epoch, the projection of each sample’s per-sample gradient onto the epoch’s accumulated gradient direction. Depth 2 then aggregates these per-epoch contributions over a sliding window of length $J$ and emphasizes their temporal variability (e.g., windowed variance). We maintain windowed statistics in a streaming manner (constant-time updates), so full trajectories need not be stored.

Execution. One-time selection with a proxy; the large model then trains on the coreset for the full $T$ epochs:

i)

Train proxy: train a proxy for $T_{\text{proxy}}$ epochs on all $N$ samples (per-example forward FLOPs $f$ , parameters $p$ ); accumulate the epoch gradient direction. 2. ii)

Per-sample gradients: after each proxy epoch, run a scoring sweep to compute per-sample gradients and their projections onto the epoch direction (costing one forward+backward per sample); update the $J$ -length windowed statistics and TDDS score (streaming). 3. iii)

Select & train large: rank by TDDS and keep the top- $k$ ; train the large model on these $k$ samples for all $T$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. Streaming TDDS requires a $J$ -length buffer of scalar contributions per example (and a temporary epoch-direction vector):

[TABLE]

Example scenario values (ImageNet-1k; $T{=}90$ , $T_{\text{proxy}}{=}90$ , $J{=}10$ ).

[TABLE]

E.3.3 Using Prediction Uncertainty with a Proxy (Dyn-Unc, DUAL)

Dyn-Unc (uncertainity_he2024_dynunc) measures prediction uncertainty via a sliding window of length $J$ over per-example target-class probabilities and averages the windowed uncertainty across proxy training.

DUAL (cho2025_DUAL) combines uncertainty with difficulty (window-mean prediction) and computes scores from an early stage of proxy training.

Execution. One-time selection with a proxy; the large model then trains on the coreset for the full $T$ epochs:

i)

Train proxy: train a proxy with per-example forward FLOPs $f$ and parameters $p$ for $T_{\text{proxy}}$ epochs; maintain sliding-window statistics (length $J$ ) via $O(1)$ updates per visit. 2. ii)

Score & select:

•

Dyn-Unc: use windowed uncertainty (variance over the last $J$ predictions), averaged over all proxy epochs $T_{\text{proxy}}$ .

•

DUAL: use the product of windowed uncertainty and difficulty (window mean), averaged over the early proxy epochs $T_{\text{proxy,early}}\leq T_{\text{proxy}}$ .

•

Beta sampling (DUAL): apply pruning-ratio–adaptive sampling based on a Beta distribution to stabilize extreme pruning. This adds negligible compute and storage. 3. iii)

Train on coreset (large model): train for all $T$ epochs on the top- $k$ points.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. Only require a scalar window of values per example.

[TABLE]

Example scenario values (ImageNet-1k; $T{=}90$ , $T_{\text{proxy}}{=}90$ , $T_{\text{proxy,early}}{=}50$ , $J{=}10$ ).

[TABLE]

E.4 Our Method - Correlation of Loss Differences (CLD)

CLD builds a coreset by leveraging only loss values over training: it records the per-epoch losses of all training points and a small held-out query set, then ranks training examples using the correlation of loss differences across epochs between train and query. No gradients or Hessians are required.

Execution. One-time selection with a proxy, followed by large-model training:

i)

Train proxy: train a proxy model for $T_{\text{proxy}}$ epochs on all $N$ samples (per-example forward cost $f$ , parameters $p$ ). 2. ii)

Collect losses: during proxy training, record per-epoch losses for all $N$ training samples (no extra compute beyond the training pass), and run a forward pass on all $Q$ query samples each epoch to record their losses ( $Qf$ FLOPs per epoch). 3. iii)

Score & select: compute CLD scores (correlations of loss differences over epochs) and select a size- $k$ coreset. The arithmetic for correlations/ranking is linear-time in the number of stored losses and is negligible compared to FLOPs above. 4. iv)

Train on coreset: train the large model on the selected $k$ points for $T$ epochs.

End-to-end compute (selection + coreset training).

[TABLE]

Selection-stage storage overhead. We store loss scalars for all $N$ training and $Q$ query samples across $T_{\text{proxy}}$ epochs:

[TABLE]

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ ).

[TABLE]

Appendix F Comparison of CLD with Influence

The impact measured by $\mathtt{CLD}$ closely aligns with the “influence" of individual training samples on a model’s predictions that are measured by Training Data Attribution (TDA) methods. TDA methods have been widely employed for tasks such as debugging datasets, interpreting models, and optimizing training efficiency Koh2017; representer_yeh2018; FZ_infl_feldman2020.

The earliest TDA methods utilized Leave-One-Out (LOO) training, which involves retraining the model after removing specific data points and observing the changes in performance. While straightforward, LOO retraining is computationally prohibitive for modern deep learning models due to the need for multiple retraining cycles Koh2017. Recent TDA metrics, such as FZ-Influence ( $\mathtt{Infl}$ ) FZ_infl_feldman2020 and $\mathtt{Datamodels}$ datamodels_ilyas2022, have gained popularity owing to precomputed scores for widely-used datasets in computer vision. These methods, however, face scalability challenges.

A prominent alternative that arose was Influence Functions, which estimated the effect of downweighting individual samples using first-order (gradient) and second-order (Hessian) computations Koh2017; influence_fragile_basu2021 performed at the end of training. Methods like $\mathtt{RandSelect}$ Randselect_wojnowicz2016 and $\mathtt{Arnoldi}$ iterations Arnoldi_schioppa2022 improved computational efficiency by approximating the Hessian. Similarly, $\mathtt{TRAK}$ trak_2023park combined random projections, gradient-based methods, and ensembling to estimate the influence of training samples. However, these approaches often rely on strong assumptions, such as convergence to a unique optimal solution, which limits their applicability to neural networks. Additionally, Hessian computations introduce significant computational overhead. To address these challenges, unrolling-based methods that observe the learning process across training iterations have been proposed. These techniques approximate the impact of samples by differentiating through the optimization trajectory dataclensing_hara2019. Among these, $\mathtt{TracIn}$ tracin_pruthi2020 is a highly efficient method that estimates influence using gradients tracked throughout training. Its practical implementation, $\mathtt{TracInCP}$ , uses intermediate checkpoints to alleviate computational burdens. While effective, unrolling methods require storing intermediate training states, leading to high storage and computational costs.

In contrast, $\mathtt{CLD}$ solely relies on loss trajectories rather than first- or second-order quantities (e.g., gradients and Hessians).

In order to measure the “influence" of a training sample $\vec{z}_{m}$ on an individual unseen (or query) sample $\vec{z}_{q}$ , we modified Definition˜1 slightly to be

[TABLE]

We will now compare the impact measured by this metric ( $\mathtt{CLD}_{\texttt{infl}}$ ) to the influence measured by TDA metrics, by utilizing the linear datamodeling score ( $\mathtt{LDS}$ ) introduced by trak_2023park. $\mathtt{LDS}$ measures the correlation between group-level attribution scores ( $\mathtt{CLD}_{\texttt{infl}}$ or influence) and their observed impact on model predictions when subsets of training data are used.

$\mathbf{LDS}$ ** definition** For a query data point $z_{q}$ , random subsets $\{\mathcal{S}_{j}\}_{j=1}^{C}$ are sampled from the training dataset, where each subset $\mathcal{S}_{j}$ contains $\lceil\alpha N\rceil$ points, with $\alpha\in(0,1)$ as the sampling ratio. Each subset $\mathcal{S}_{j}$ is used to retrain the model $R$ times with different initializations $\{\xi_{r}\}_{r=1}^{R}$ and training parameters $\lambda$ , resulting in the model $\theta^{T}_{S_{j},\xi_{r}}$ . This trained model is then used to compute a measurable quantity $f(\vec{z}_{q},\theta^{T}_{\mathcal{S}_{j},\xi_{r}})$ . A group attribution score, $g_{\tau}(\vec{z}_{q},\mathcal{S}_{j},\mathcal{S})$ , is calculated as $g_{\tau}(\vec{z}_{q},\mathcal{S}_{j},\mathcal{S})\coloneqq\sum_{\vec{z}\in\mathcal{S}_{j}}\tau(\vec{z}_{q},\vec{z},\mathcal{S}),$ where $\tau(\vec{z}_{q},\vec{z},\mathcal{S})$ is the attribution score for a training point $\vec{z}$ with respect to $\vec{z}_{q}$ . The $\mathtt{LDS}$ is then obtained using Spearman’s rank spearman1904 correlation ( $\rho_{s}$ ):

[TABLE]

Experimental Setup: We compared the $\mathtt{LDS}$ scores of $\mathtt{CLD}_{\texttt{infl}}$ against those of $\mathtt{TRAK}$ , $\mathtt{Arnoldi}$ , $\mathtt{TracIn}$ , $\mathtt{Infl}$ , and $\mathtt{Datamodels}$ . Precomputed scores for $\mathtt{Infl}$ and $\mathtt{Datamodels}$ were used for the CIFAR-10 dataset cifar with ResNet-9 resnet, while 10 models were trained for $\mathtt{TRAK}$ , $\mathtt{Arnoldi}$ , $\mathtt{TracIn}$ , and $\mathtt{CLD}_{\texttt{infl}}$ . The evaluation employed $C=100$ random subsets, sampling ratios $\alpha$ ranging from $0.3$ to $\frac{N-1}{N}$ , a query set of 200 samples, and $R=10$ seeds. The measurable quantity was the accuracy of query samples.

Results and Observations: The results presented in Figure˜8(a) reveal that while the impact captured by $\mathtt{CLD}_{\texttt{infl}}$ is distinct from the influence measured by traditional TDA metrics, it aligns closely with methods such as $\mathtt{TracIn}$ and $\mathtt{TRAK}$ in terms of behavior while being resource-efficient. Notably, the performance gap between these computationally intensive methods and $\mathtt{CLD}_{\texttt{infl}}$ narrows as $\alpha$ increases. The drop in $\mathtt{LDS}$ scores at $\alpha=\frac{N-1}{N}$ is due to the stochastic nature of model retraining222This observation is consistent with the findings of previous research revisitinglds_karthikeyan2021; source_bae2024..

Takeaways: Although $\mathtt{CLD}_{\texttt{infl}}$ (and in essence $\mathtt{CLD}$ ) fundamentally differs from influence-based TDA metrics, it mirrors their trends at higher sampling ratios while maintaining superior computational efficiency, solidifying its utility as a practical tool for analyzing training dynamics.

F.1 Importance of the Top-k Samples

We demonstrate that the samples identified by $\mathtt{CLD}$ are indeed pivotal for generalization, addressing the question: “Are the training samples with the top- $k$ scores truly the most critical for forming a coreset?” This is evaluated using the prediction brittleness metric. This is also mentioned briefly in Section˜8.

Experimental Setup: To quantify the influence of top- $k$ samples, we systematically removed the most impactful data points identified by their $\mathtt{CLD}$ scores, from the training set and retrained the model. The metric of interest was the fraction of prediction flips observed in a held-out query set after retraining. If these samples are truly critical for generalization, their removal should cause substantial prediction changes. This experiment also included a comparative analysis with the top- $k$ influential samples identified by TDA scores, discussed in this section. Experiments were performed on the CIFAR-10 dataset using a ResNet-9 architecture, with a randomly selected query set of 200 samples. For each configuration, once the top- $k$ samples were excluded, the model was retrained 5 times to account for randomness, and the average fraction of prediction flips was recorded.

Results and Observations: The results, summarized in Figure˜8(b), illustrate that the top- $k$ samples identified by $\mathtt{CLD}$ have a comparable influence on prediction outcomes to those identified by TDA-based metrics such as $\mathtt{TracIn}$ and $\mathtt{TRAK}$ . Notably, removing the top-800 samples of CIFAR-10, which constitutes just 1.6% of the dataset, results in prediction flips for over half of the query set. This highlights the significant role of the samples identified by $\mathtt{CLD}$ in supporting model generalization. While metrics like $\mathtt{Datamodels}$ and $\mathtt{Infl}$ exhibit greater impact, they are computationally prohibitive, rendering them unsuitable for large-scale coreset generation.

Takeaways: $\mathtt{CLD}$ emerges as an effective and computationally efficient approach for identifying training samples critical to generalization, making it a practical tool for coreset selection in large-scale machine learning pipelines.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

Abstract

1 Introduction

2 Related Literature

3 Preliminaries and Problem Setup

4 Correlation of Loss Differences (CLD)

Loss Trajectories

Definition 1** (Correlation of Loss Differences (CLD\mathtt{CLD}CLD)).**

4.1 Coreset Selection Procedure

5 Theoretical Analysis of CLD-Coresets

Assumption 1** (LLL-smoothness).**

Assumption 2** (Bounded Gradient Norm).**

Assumption 3** (Validation Representativeness).**

Remark 1** (Per-Class Validation Trajectories).**

Theorem 1** (Convergence with CLD\mathtt{CLD}CLD-Coresets).**

Proof Sketch.

Interpreting the Theory

Corollary 1** (Necessity of High CLD\mathtt{CLD}CLD for Good Coresets).**

6 Experimental Evaluation

Experimental Setup

Results and Observations

Takeaways

7 Computational and Storage Efficiency

Notation and setup.

Results and observations (compute).

Results and observations (storage).

A visual summary.

8 Discussion

Stability under temporal subsampling.

Bias reduction and stratified sampling.

Validation proxy: composition and size matter.

Loss Differences as a Gradient-Free Proxy for Influence.

Beyond Supervised Vision: Scope and Caveats.

9 Limitations

10 Conclusion

Acknowledgements

Appendix A CLD-Coreset Selection Algorithm

Appendix B Detailed Theoretical Framework

B.1 Roadmap and Notation

B.2 Supporting Lemmas for Theorem˜1

Lemma 1** (High CLD\mathtt{CLD}CLD Implies Gradient Alignment).**

Proof Outline.

Proof.

Remark 2** (On Update Sequence Variation).**

Lemma 2** (Stability of Gradient Alignment).**

Proof outline..

Proof.

Remark 3**.**

Remark 4** (Influence of Coreset Size on κ\kappaκ).**

Lemma 3** (Subset-Gradient Approximation).**

Proof outline..

Proof.

Remark 5**.**

B.3 Proof for Theorem˜1

Proof outline..

Proof.

Appendix C Datasets, Models, and Experimental Details

Architectures.

Training Setup for CIFAR-100.

Training Setup for ImageNet-1k.

Reproducibility.

Results.

Appendix D Additional Ablations

D.1 Stability across random seeds

D.2 Minimum subset size for full-data accuracy

Appendix E Detailed Explanation of Compute and Storage Cost of Coreset Methodologies

Recap of Notation

Reference: full-data training (large model)

Example scenario (used for all plug-in estimates).

E.1 Score-based Methods

E.1.1 Kernel Herding (Herding)

Example scenario values (ImageNet-1k; Tlate=90T_{\text{late}}{=}90Tlate​=90, Tproxy=90T_{\text{proxy}}{=}90Tproxy​=90).

E.1.2 Example Forgetting (Forgetting)

Example scenario values (ImageNet-1k; Tearly=10T_{\text{early}}{=}10Tearly​=10, Tlate=80T_{\text{late}}{=}80Tlate​=80).

Definition 1 (Correlation of Loss Differences ( $\mathtt{CLD}$ )).

Assumption 1 ( $L$ -smoothness).

Assumption 2 (Bounded Gradient Norm).

Assumption 3 (Validation Representativeness).

Remark 1 (Per-Class Validation Trajectories).

Theorem 1 (Convergence with $\mathtt{CLD}$ -Coresets).

Corollary 1 (Necessity of High $\mathtt{CLD}$ for Good Coresets).

Lemma 1 (High $\mathtt{CLD}$ Implies Gradient Alignment).

Remark 2 (On Update Sequence Variation).

Lemma 2 (Stability of Gradient Alignment).

Remark 3.

Remark 4 (Influence of Coreset Size on $\kappa$ ).

Lemma 3 (Subset-Gradient Approximation).

Remark 5.

Example scenario values (ImageNet-1k; $T_{\text{late}}{=}90$ , $T_{\text{proxy}}{=}90$ ).

Example scenario values (ImageNet-1k; $T_{\text{early}}{=}10$ , $T_{\text{late}}{=}80$ ).

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ ).

Example scenario values: ImageNet-1k; $U{=}N$ ).

Example scenario values (ImageNet-1k; $R{=}10$ , $T_{\text{early}}{=}10$ , $T_{\text{late}}{=}80$ ).

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ ).

E.1.7 Message Passing ( $\mathbb{D}^{2}-\mathtt{Pruning}$ )

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ ).

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ , $F{=}512$ , $C{=}1000$ ).

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $K_{\max}{=}50$ so $\bar{K}{\approx}50$ , $T{=}90$ ).

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T_{\text{late}}{=}90$ , $R{=}10$ ).

Example scenario values (ImageNet-1k; $T{=}90$ , $T_{\text{proxy}}{=}90$ , $J{=}10$ ).

Example scenario values (ImageNet-1k; $T{=}90$ , $T_{\text{proxy}}{=}90$ , $T_{\text{proxy,early}}{=}50$ , $J{=}10$ ).

Example scenario values (ImageNet-1k; $T_{\text{proxy}}{=}90$ , $T{=}90$ ).