WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Benjamin Feuer; Chinmay Hegde

arXiv:2501.18511·cs.LG·May 26, 2025

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Benjamin Feuer, Chinmay Hegde

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces WILDCHAT-50M, the largest public chat dataset with responses from over 50 models, enabling large-scale analysis of synthetic data's role in language model post-training and demonstrating its effectiveness in improving model performance.

Contribution

The paper presents WILDCHAT-50M, a comprehensive dataset for analyzing synthetic data in LLM post-training, and develops RE-WILD, a superior SFT mixture using this dataset.

Findings

01

WILDCHAT-50M includes responses from 50+ models, enhancing analysis capabilities.

02

RE-WILD outperforms Tulu-3 SFT with fewer samples.

03

The dataset and code are publicly available for research use.

Abstract

Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

penfever/wildchat-50m
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Cosine Annealing · Adam · Softmax · Dropout