Distilling Translations with Visual Awareness
Julia Ive, Pranava Madhyastha, Lucia Specia

TL;DR
This paper introduces a translate-and-refine method that effectively incorporates visual context in multimodal translation, improving translation quality and robustness against source errors, achieving state-of-the-art results.
Contribution
It presents a novel joint training approach where images are used in a second decoding stage to enhance translation accuracy and handle source errors.
Findings
Achieves state-of-the-art translation performance.
Improves handling of ambiguous words with visual context.
Recovers from erroneous or missing source words.
Abstract
Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where images are only used by a second stage decoder. This approach is trained jointly to generate a good first draft translation and to improve over this draft by (i) making better use of the target language textual context (both left and right-side contexts) and (ii) making use of visual context. This approach leads to the state of the art results. Additionally, we show that it has the ability to recover from erroneous or missing words in the source language.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
