Critical Learning Periods: Leveraging Early Training Dynamics for   Efficient Data Pruning

Everlyn Asiko Chimoto; Jay Gala; Orevaoghene Ahia; Julia Kreutzer,; Bruce A. Bassett; Sara Hooker

arXiv:2405.19462·cs.CL·June 24, 2024

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

Everlyn Asiko Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer,, Bruce A. Bassett, Sara Hooker

PDF

Open Access 1 Video

TL;DR

This paper introduces CAT, a data pruning method that uses early training dynamics to select relevant data points, significantly reducing training data size while maintaining translation quality.

Contribution

The paper presents a novel data pruning technique, Checkpoints Across Time (CAT), leveraging early training signals to improve efficiency in neural machine translation.

Findings

01

CAT outperforms existing pruning methods on multiple language pairs.

02

Pruning up to 50% of data with minimal performance loss.

03

Selected data tends to include longer, rarer sentences.

Abstract

Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning· underline

Taxonomy

TopicsStatistics Education and Methodologies

MethodsPruning