Towards Zero-Shot Multimodal Machine Translation

Matthieu Futeral; Cordelia Schmid; Beno\^it Sagot; Rachel; Bawden

arXiv:2407.13579·cs.CL·March 12, 2025·1 cites

Towards Zero-Shot Multimodal Machine Translation

Matthieu Futeral, Cordelia Schmid, Beno\^it Sagot, Rachel, Bawden

PDF

Open Access 2 Repos 3 Models 1 Video

TL;DR

This paper introduces ZeroMMT, a zero-shot multimodal machine translation approach that adapts text-only models using visual context, enabling translation disambiguation without fully supervised multimodal data, and demonstrates competitive results on multiple benchmarks.

Contribution

The work presents a novel zero-shot method for multimodal translation that requires only English data, bypassing the need for costly fully supervised multimodal datasets.

Findings

01

Achieves near state-of-the-art disambiguation performance

02

Extends evaluation to Arabic, Russian, and Chinese

03

Allows control over translation and disambiguation trade-off

Abstract

Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Towards Zero-Shot Multimodal Machine Translation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling