Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining
Thales Sales Almeida, Rodrigo Nogueira, H\'elio Pedrini

TL;DR
This study shows that synthetic rewriting significantly enhances Portuguese language model training quality, especially at larger scales, by acting as a quality multiplier rather than a substitute for data curation.
Contribution
It provides a systematic analysis of how synthetic rewriting interacts with data quality in Portuguese pretraining, highlighting scale-dependent effects.
Findings
Rewriting high-quality data improves model performance by +3.4 NPM at 7B scale.
Rewriting low-quality data yields minimal gains (+0.5 NPM) at 7B scale.
The quality multiplier effect is more pronounced at larger model scales.
Abstract
Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
