Towards Zero-Shot Multimodal Machine Translation
Matthieu Futeral, Cordelia Schmid, Beno\^it Sagot, Rachel, Bawden

TL;DR
This paper introduces ZeroMMT, a zero-shot multimodal machine translation approach that adapts text-only models using visual context, enabling translation disambiguation without fully supervised multimodal data, and demonstrates competitive results on multiple benchmarks.
Contribution
The work presents a novel zero-shot method for multimodal translation that requires only English data, bypassing the need for costly fully supervised multimodal datasets.
Findings
Achieves near state-of-the-art disambiguation performance
Extends evaluation to Arabic, Russian, and Chinese
Allows control over translation and disambiguation trade-off
Abstract
Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
