Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung; Seungju Han; Ximing Lu; Skyler Hallinan; David Acuna; Shrimai Prabhumoye; Mostafa Patwary; Mohammad Shoeybi; Bryan Catanzaro; Yejin Choi

arXiv:2505.20161·cs.LG·May 27, 2025

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces a gradient-based diversity metric called G-Vendi and a synthetic data generation framework, Prismatic Synthesis, which together enhance the generalization of large language models, especially on out-of-distribution tasks.

Contribution

The paper proposes G-Vendi as a novel diversity metric and Prismatic Synthesis as a data augmentation method, significantly improving LLM reasoning generalization with less data.

Findings

01

G-Vendi correlates strongly with out-of-distribution performance (Spearman's ρ ≈ 0.9)

02

Prismatic Synthesis improves model performance across OOD benchmarks

03

A smaller model trained with synthetic data outperforms larger models trained on proprietary data

Abstract

Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Jaehun/PrismNLI-0.4B
model· 5 dl
5 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods

MethodsBalanced Selection