SOTASTREAM: A Streaming Approach to Machine Translation Training

Matt Post; Thamme Gowda; Roman Grundkiewicz; Huda Khayrallah; and Rohit Jain; Marcin Junczys-Dowmunt

arXiv:2308.07489·cs.CL·August 16, 2023

SOTASTREAM: A Streaming Approach to Machine Translation Training

Matt Post, Thamme Gowda, Roman Grundkiewicz, Huda Khayrallah, and Rohit Jain, Marcin Junczys-Dowmunt

PDF

Open Access 1 Repo

TL;DR

SOTASTREAM introduces a streaming data approach for machine translation training that eliminates static preprocessing, enabling flexible, efficient, and on-the-fly data manipulation without sacrificing model accuracy.

Contribution

It presents a novel streaming data pipeline for MT training that simplifies data handling and enhances flexibility compared to traditional static preprocessing methods.

Findings

01

Reduces training time significantly.

02

Decreases disk space requirements.

03

Maintains model accuracy with flexible data manipulation.

Abstract

Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marian-nmt/sotastream
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis