Error Norm Truncation: Robust Training in the Presence of Data Noise for   Text Generation Models

Tianjian Li; Haoran Xu; Philipp Koehn; Daniel Khashabi; Kenton Murray

arXiv:2310.00840·cs.CL·March 20, 2024

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Tianjian Li, Haoran Xu, Philipp Koehn, Daniel Khashabi, Kenton Murray

PDF

Open Access 1 Video

TL;DR

This paper introduces Error Norm Truncation (ENT), a novel method that enhances the robustness of text generation models trained on noisy web data by effectively truncating errors, leading to improved quality and noise resilience.

Contribution

We propose ENT, a new truncation technique that better estimates data quality by considering token distribution, significantly improving robustness and performance in noisy training environments.

Findings

01

ENT improves generation quality over standard training.

02

ENT increases robustness against data noise, boosting BLEU scores.

03

Model performance gains are consistent across language tasks.

Abstract

Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification