GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization
Tianhao Tang, Haoyang Li, Lei Chen

TL;DR
GRACE is a novel framework that dynamically selects representative data subsets for efficient large language model training, balancing informativeness and computational cost.
Contribution
It introduces a graph-guided, adaptive coreset selection method that updates during training to improve efficiency and performance of LLMs.
Findings
GRACE reduces training costs significantly.
It maintains or improves downstream task performance.
The method scales effectively to large models.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their immense number of parameters and complex transformer-based architectures result in significant resource demands and computational complexity during training, making it challenging to optimize them efficiently on large datasets. To reduce training costs while preserving performance, researchers have investigated coreset selection techniques, which aim to identify small, representative subsets of the entire training dataset to accelerate LLM training. However, existing coreset selection methods fail to adapt to the dynamic nature of LLM training and often struggle with scalability for models of this size. To address these limitations, we propose a graph-guided adaptive and dynamic coreset selection framework for LLMs, namely GRACE. GRACE dynamically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
