Curi\'o-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining
Thales Sales Almeida, Rodrigo Nogueira, H\'elio Pedrini

TL;DR
This paper demonstrates that selective data training, focusing on quality over quantity, can significantly enhance language model adaptation, as shown by the superior performance of a smaller, curated model over a larger, full-data model.
Contribution
Introduces Curi�o-Edu 7B, a Portuguese language model trained on a curated subset, showing data quality's importance in linguistic adaptation with less data and computation.
Findings
Curi�o-Edu 7B outperforms the full-corpus model in evaluations.
Selective data training can be more effective than larger-scale training.
Limited data with quality focus can achieve better adaptation results.
Abstract
Continued pretraining extends a language model's capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curi\'o 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curi\'o-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
