Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

Aryan Sajith; Krishna Chaitanya Rao Kathala

arXiv:2411.15821·cs.CL·November 11, 2025

Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

Aryan Sajith, Krishna Chaitanya Rao Kathala

PDF

1 Repo

TL;DR

This paper empirically demonstrates that training data quality has a more substantial impact on small language model performance than data quantity, with implications for sustainable and accessible AI development.

Contribution

It provides a systematic analysis of how data quality and duplication affect small language model performance, highlighting the importance of data quality over quantity.

Findings

01

Data quality significantly influences model performance.

02

Moderate duplication improves accuracy without increasing perplexity.

03

Excessive duplication severely degrades performance.

Abstract

This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs), utilizing the TinyStories dataset for empirical analysis. Analysis of dataset variations with respect to size (25% and 50% of the original size) and duplication (controlled rates of 25%, 50%, 75%, and 100%) were performed. Model performance was evaluated based on the validation loss, accuracy, and perplexity metrics. Results indicate training data quality plays a more significant role in the overall performance of SLMs, especially given scale of this experiment. Minimal duplication positively impacted model accuracy (+0.87% increase in accuracy at 25% duplication) without significantly increasing perplexity (+0.52% increase going from 0% to 25% duplication) but excessive duplication led to pronounced performance degradation (-40% drop in accuracy at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aryan-sajith/urv-data_quantity_vs_data_quality-research
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.