Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Aladin Djuhera; Swanand Ravindra Kadhe; Syed Zawad; Farhan Ahmed; Heiko Ludwig; Holger Boche

arXiv:2506.06522·cs.CL·February 9, 2026

Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Aladin Djuhera, Swanand Ravindra Kadhe, Syed Zawad, Farhan Ahmed, Heiko Ludwig, Holger Boche

PDF

Open Access 5 Datasets 1 Video

TL;DR

This paper systematically compares two open post-training datasets for large language models, analyzes their quality, and creates a new optimized dataset that improves model performance efficiently.

Contribution

It provides the first detailed comparison of open post-training datasets, introduces a new curation method, and releases resources for future research.

Findings

01

TuluTalk dataset outperforms source datasets on key benchmarks.

02

Structural and quality differences identified between Tulu-3-SFT-Mix and SmolTalk.

03

Curated dataset achieves similar or better performance with 14% fewer samples.

Abstract

Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance· slideslive

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Machine Learning in Materials Science