Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures
Dang Nguyen, Wenhan Yang, Rathul Anand, Yu Yang, Baharan Mirzasoleiman

TL;DR
This paper introduces CoLM, a novel method for training large language models with small mini-batch coresets that match larger batch gradients, significantly reducing memory use while maintaining or improving performance.
Contribution
CoLM addresses the challenges of mini-batch coreset selection for LLMs by ensuring representative small source examples, normalizing gradients for Adam, and sparsifying gradient matrices, enabling efficient training.
Findings
Reduces fine-tuning memory by 2x
Outperforms training with 4x larger mini-batches
Seamlessly integrates with existing methods like LoRA
Abstract
Training with larger mini-batches improves the convergence rate and can yield superior performance. However, training with large mini-batches becomes prohibitive for Large Language Models (LLMs), due to the large GPU memory requirement. To address this problem, an effective approach is finding small mini-batch coresets that closely match the gradient of larger mini-batches. However, this approach becomes infeasible and ineffective for LLMs, due to the highly imbalanced mixture of sources in language data, use of the Adam optimizer, and the very large gradient dimensionality of LLMs. In this work, we address the above challenges by proposing Coresets for Training LLMs (CoLM). First, we show that mini-batch coresets found by gradient matching do not contain representative examples of the small sources w.h.p., and thus including all examples of the small sources in the mini-batch coresets…
Peer Reviews
Decision·ICLR 2025 Poster
1. The approach takes a unique and innovative perspective on memory-efficient training. Unlike typical methods in field, it focuses on data selection and reduces batch size to achieve memory-efficient training. 2. The paper is clear to understand and well motivated. 2. The paper presents comprehensive experiments on diverse datasets and models, substantiating the benefits of CoLM over various baselines.
1. Computational overhead: Figure 2(b) suggests that the computational overhead for CoLM is notably high, doubling the training time in some cases, whereas typical memory-efficient methods usually incur 5-20% overhead. A comparison of "memory vs. computational overhead vs. performance" with other methods would be beneficial. 2. The paper mainly evaluates CoLM in a fine-tuning context, where model and optimizer state memory are the main bottlenecks. It would be valuable to see CoLM tested in pre-
- The presentation is clear, and the paper is easy to follow, with only a few minor typos. - The proposed method, CoLM, is straightforward and demonstrates strong empirical performance.
- **Limited Base Models**: While the authors mention using Phi-2, Phi-3, and Zephyr, the main results in Tables 1 and 7 only report Phi-2. To more fully demonstrate CoLM’s performance across various models, I recommend including Phi-3 in these tables and consider adding state-of-the-art models, such as LLaMA-3 8B and LLaMA-3.1. - **Insufficient Discussion of Related Work and Novelty**: One of CoLM’s contributions is gradient normalization for the Adam optimizer, which improves mini-batch approx
- The paper is very well written. It provides clear motivations of the problem and in thoughtfully structured around the three key difficulties of applying coreset finding methods to LLM fine-tuning. This structure makes the paper clear and very easy to follow. quality, - The authors provide extensive ablation analyses about the great majority (if not all) of the improvements they propose, which is greatly appreciated and helps show they are worthwhile. - The authors empirically demonstrate th
- Using gradient accumulation + LoRA trivially reduces the memory overhead. The paper would benefit from experiments making an explicit comparison to this case. It seems from Fig 2 (a) that you should be able to claim CoLM is still faster in this case, but not showing it and claiming memory efficiency makes the paper seem weak. - While the memory overhead is reduced due to the smaller batch size, there is no figure showing how much the total memory decreased by making the batch size smaller whe
Videos
Taxonomy
TopicsInnovative Microfluidic and Catalytic Techniques Innovation
MethodsAdam · Coresets · Pruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
