Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis
Sabri Boughorbel, MD Rizwan Parvez, Majd Hawasly

TL;DR
This paper explores enhancing low-resource language models trained on translated data by combining continual pre-training with dictionary learning analysis to address translation quality and bias issues.
Contribution
It introduces a method of using high-quality synthetic data and dictionary learning to improve language models trained on translated datasets, reducing translation pitfalls.
Findings
Improved model performance after continual pre-training with synthetic data
Reduction in cultural and linguistic biases in models
Dictionary learning analysis helps interpret model improvements
Abstract
Training LLMs for low-resource languages usually utilizes data augmentation from English using machine translation (MT). This, however, brings a number of challenges to LLM training: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, data quality degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Lexicography and Language Studies
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout
