Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

TL;DR
This paper introduces a gradient-based diversity metric called G-Vendi and a synthetic data generation framework, Prismatic Synthesis, which together enhance the generalization of large language models, especially on out-of-distribution tasks.
Contribution
The paper proposes G-Vendi as a novel diversity metric and Prismatic Synthesis as a data augmentation method, significantly improving LLM reasoning generalization with less data.
Findings
G-Vendi correlates strongly with out-of-distribution performance (Spearman's ρ ≈ 0.9)
Prismatic Synthesis improves model performance across OOD benchmarks
A smaller model trained with synthetic data outperforms larger models trained on proprietary data
Abstract
Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
MethodsBalanced Selection
