Imagination improves Multimodal Translation
Desmond Elliott, \'Akos K\'ad\'ar

TL;DR
This paper introduces a multitask learning approach that enhances multimodal translation by integrating visual grounding and translation tasks, leading to improved performance on benchmark datasets.
Contribution
It presents a novel multitask framework combining translation and visual grounding, demonstrating effectiveness even with external datasets for both tasks.
Findings
Improved translation accuracy on Multi30K dataset
Effective use of external MS COCO dataset for image prediction
Enhanced translation performance with external parallel text
Abstract
We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded representations are learned through image representation prediction. Our approach improves translation performance compared to the state of the art on the Multi30K dataset. Furthermore, it is equally effective if we train the image prediction task on the external MS COCO dataset, and we find improvements if we train the translation model on the external News Commentary parallel text.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
