Does Multimodality Help Human and Machine for Translation and Image Captioning?
Ozan Caglayan, Walid Aransa, Yaxing Wang, Marc Masana, Mercedes, Garc\'ia-Mart\'inez, Fethi Bougares, Lo\"ic Barrault, Joost van de Weijer

TL;DR
This paper investigates whether multimodal data improves translation and image captioning by comparing different systems and evaluating both automatic metrics and human judgment, demonstrating that multimodal approaches yield superior results.
Contribution
It introduces and compares multimodal and monomodal systems for translation and captioning, showing the benefits of multimodal data through comprehensive evaluation.
Findings
Multimodal systems outperform monomodal ones in BLEU and METEOR scores.
Human evaluation indicates multimodal data enhances translation and captioning quality.
The best results were achieved by systems using multimodal data.
Abstract
This paper presents the systems developed by LIUM and CVC for the WMT16 Multimodal Machine Translation challenge. We explored various comparative methods, namely phrase-based systems and attentional recurrent neural networks models trained using monomodal or multimodal data. We also performed a human evaluation in order to estimate the usefulness of multimodal data for human machine translation and image description generation. Our systems obtained the best results for both tasks according to the automatic evaluation metrics BLEU and METEOR.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis
