SOTASTREAM: A Streaming Approach to Machine Translation Training
Matt Post, Thamme Gowda, Roman Grundkiewicz, Huda Khayrallah, and Rohit Jain, Marcin Junczys-Dowmunt

TL;DR
SOTASTREAM introduces a streaming data approach for machine translation training that eliminates static preprocessing, enabling flexible, efficient, and on-the-fly data manipulation without sacrificing model accuracy.
Contribution
It presents a novel streaming data pipeline for MT training that simplifies data handling and enhances flexibility compared to traditional static preprocessing methods.
Findings
Reduces training time significantly.
Decreases disk space requirements.
Maintains model accuracy with flexible data manipulation.
Abstract
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis
