Improving Language Models Trained on Translated Data with Continual   Pre-Training and Dictionary Learning Analysis

Sabri Boughorbel; MD Rizwan Parvez; Majd Hawasly

arXiv:2405.14277·cs.CL·August 8, 2024

Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis

Sabri Boughorbel, MD Rizwan Parvez, Majd Hawasly

PDF

Open Access

TL;DR

This paper explores enhancing low-resource language models trained on translated data by combining continual pre-training with dictionary learning analysis to address translation quality and bias issues.

Contribution

It introduces a method of using high-quality synthetic data and dictionary learning to improve language models trained on translated datasets, reducing translation pitfalls.

Findings

01

Improved model performance after continual pre-training with synthetic data

02

Reduction in cultural and linguistic biases in models

03

Dictionary learning analysis helps interpret model improvements

Abstract

Training LLMs for low-resource languages usually utilizes data augmentation from English using machine translation (MT). This, however, brings a number of challenges to LLM training: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, data quality degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Translation Studies and Practices · Lexicography and Language Studies

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout