Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach   for Mathematical Domain Adaptation

Madhav Kotecha; Vijendra Kumar Vaishya; Smita Gautam; Suraj Racha

arXiv:2505.01523·cs.LG·May 6, 2025

Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation

Madhav Kotecha, Vijendra Kumar Vaishya, Smita Gautam, Suraj Racha

PDF

Open Access

TL;DR

This paper introduces a utility-diversity balanced subset selection method for fine-tuning large language models on mathematical data, reducing training costs while maintaining high performance.

Contribution

It presents a novel subset selection approach combining utility and diversity metrics to efficiently fine-tune LLMs for mathematical domains.

Findings

01

Achieves near-full dataset performance with fewer training examples

02

Reduces computational cost and training time significantly

03

Outperforms baseline subset selection methods

Abstract

We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain by employing a budgeted subset selection method. Our approach combines utility and diversity metrics to select the most informative and representative training examples. The final goal is to achieve near-full dataset performance with meticulously selected data points from the entire dataset while significantly reducing computational cost and training time and achieving competitive performance as the full dataset. The utility metric incorporates both perplexity and Chain-of-Thought (CoT) loss to identify challenging examples that contribute most to model learning, while the diversity metric ensures broad coverage across mathematical subdomains. We evaluate our method on LLaMA-3 8B and Phi-3 models, comparing against several baseline approaches, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications