Multimodal Attention for Neural Machine Translation
Ozan Caglayan, Lo\"ic Barrault, Fethi Bougares

TL;DR
This paper explores a multimodal attention mechanism that jointly focuses on images and text to improve neural machine translation, demonstrating significant performance gains over text-only models.
Contribution
It introduces a novel multimodal attention mechanism for NMT that leverages both visual and textual information simultaneously.
Findings
Up to 1.6 BLEU and METEOR score improvements
Dedicated attention per modality enhances translation quality
Effective integration of image and text modalities in NMT
Abstract
The attention mechanism is an important part of the neural machine translation (NMT) where it was reported to produce richer source representation compared to fixed-length encoding sequence-to-sequence models. Recently, the effectiveness of attention has also been explored in the context of image captioning. In this work, we assess the feasibility of a multimodal attention mechanism that simultaneously focus over an image and its natural language description for generating a description in another language. We train several variants of our proposed attention mechanism on the Multi30k multilingual image captioning dataset. We show that a dedicated attention for each modality achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
