Curi\'o-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining

Thales Sales Almeida; Rodrigo Nogueira; H\'elio Pedrini

arXiv:2512.12770·cs.CL·December 16, 2025

Curi\'o-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining

Thales Sales Almeida, Rodrigo Nogueira, H\'elio Pedrini

PDF

Open Access

TL;DR

This paper demonstrates that selective data training, focusing on quality over quantity, can significantly enhance language model adaptation, as shown by the superior performance of a smaller, curated model over a larger, full-data model.

Contribution

Introduces Curi�o-Edu 7B, a Portuguese language model trained on a curated subset, showing data quality's importance in linguistic adaptation with less data and computation.

Findings

01

Curi�o-Edu 7B outperforms the full-corpus model in evaluations.

02

Selective data training can be more effective than larger-scale training.

03

Limited data with quality focus can achieve better adaptation results.

Abstract

Continued pretraining extends a language model's capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curi\'o 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curi\'o-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education