WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Benjamin Feuer, Chinmay Hegde

TL;DR
This paper introduces WILDCHAT-50M, the largest public chat dataset with responses from over 50 models, enabling large-scale analysis of synthetic data's role in language model post-training and demonstrating its effectiveness in improving model performance.
Contribution
The paper presents WILDCHAT-50M, a comprehensive dataset for analyzing synthetic data in LLM post-training, and develops RE-WILD, a superior SFT mixture using this dataset.
Findings
WILDCHAT-50M includes responses from 50+ models, enhancing analysis capabilities.
RE-WILD outperforms Tulu-3 SFT with fewer samples.
The dataset and code are publicly available for research use.
Abstract
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗penfever/oumi-l8b-ultrachatmodel· 7 dl7 dl
- 🤗shisa-ai/shisa-v2-llama3.1-8bmodel· 58 dl· ♡ 258 dl♡ 2
- 🤗shisa-ai/shisa-v2-qwen2.5-7bmodel· 339 dl· ♡ 12339 dl♡ 12
- 🤗shisa-ai/shisa-v2-unphi4-14bmodel· 9 dl· ♡ 59 dl♡ 5
- 🤗shisa-ai/shisa-v2-mistral-nemo-12bmodel· 115 dl· ♡ 7115 dl♡ 7
- 🤗shisa-ai/shisa-v2-qwen2.5-32bmodel· 52 dl· ♡ 652 dl♡ 6
- 🤗shisa-ai/shisa-v2-llama3.3-70bmodel· 10 dl· ♡ 510 dl♡ 5
- 🤗shisa-ai/shisa-v2-llama3.1-405bmodel· 10 dl· ♡ 1910 dl♡ 19
- 🤗shisa-ai/shisa-v2-mistral-small-24bmodel· 72 dl· ♡ 372 dl♡ 3
- 🤗DataPilot/ArrowIdeative-13b-NeoBase-ZERO-llm-jp-v0.1model· 3 dl· ♡ 73 dl♡ 7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Cosine Annealing · Adam · Softmax · Dropout
