EMMeTT: Efficient Multimodal Machine Translation Training
Piotr \.Zelasko, Zhehuai Chen, Mengru Wang, Daniel Galvez, Oleksii, Hrinchuk, Shuoyang Ding, Ke Hu, Jagadeesh Balam, Vitaly Lavrukhin, Boris, Ginsburg

TL;DR
This paper introduces EMMeTT, a novel training framework that enhances multimodal neural machine translation by efficiently integrating speech and text data, leading to improved translation performance across multiple languages.
Contribution
The paper presents EMMeTT, a new training approach that improves efficiency and effectiveness of multimodal NMT models using balanced sampling, data iteration, and a 2D bucketing scheme.
Findings
Multimodal training improves translation quality for both architectures.
SALM-T5 retains NMT capabilities while outperforming AST baselines.
The framework achieves strong text and speech translation results.
Abstract
A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only GPT and encoder-decoder T5, extended with Canary-1B's speech encoder. To handle joint multimodal training, we propose a novel training framework called EMMeTT. EMMeTT improves training efficiency with the following: balanced sampling across languages, datasets, and modalities; efficient sequential data iteration; and a novel 2D bucketing scheme for multimodal data, complemented by a batch size optimizer (OOMptimizer). We show that a multimodal training consistently helps with both architectures.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsAttention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Adafactor · Gated Linear Unit · Linear Warmup With Cosine Annealing · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · SentencePiece
