Supervised Visual Attention for Simultaneous Multimodal Machine Translation
Veneta Haralampieva, Ozan Caglayan, Lucia Specia

TL;DR
This paper introduces the first Transformer-based simultaneous multimodal machine translation model with supervised visual attention, improving translation quality by leveraging image context and phrase-region alignments.
Contribution
The paper presents a novel Transformer-based architecture for simultaneous multimodal translation with supervised visual attention, a first in the field.
Findings
Supervised visual attention improves translation quality.
Fine-tuning with supervision outperforms training from scratch.
Achieves up to 2.3 BLEU and 3.5 METEOR improvements.
Abstract
Recently, there has been a surge in research in multimodal machine translation (MMT), where additional modalities such as images are used to improve translation quality of textual systems. A particular use for such multimodal systems is the task of simultaneous machine translation, where visual context has been shown to complement the partial information provided by the source sentence, especially in the early phases of translation. In this paper, we propose the first Transformer-based simultaneous MMT architecture, which has not been previously explored in the field. Additionally, we extend this model with an auxiliary supervision signal that guides its visual attention mechanism using labelled phrase-region alignments. We perform comprehensive experiments on three language directions and conduct thorough quantitative and qualitative analyses using both automatic metrics and manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
