Vision Matters When It Should: Sanity Checking Multimodal Machine   Translation Models

Jiaoda Li; Duygu Ataman; Rico Sennrich

arXiv:2109.03415·cs.CL·September 9, 2021

Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Jiaoda Li, Duygu Ataman, Rico Sennrich

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether multimodal machine translation models truly utilize visual context, revealing that current datasets and evaluation methods may not effectively encourage models to leverage visual signals, thus impeding progress.

Contribution

The study highlights dataset limitations in stimulating visual modality use and proposes methods to improve dataset design for better evaluation of visual influence in MMT.

Findings

01

Current datasets do not effectively promote visual context utilization.

02

Models show limited reliance on images in existing benchmarks.

03

Recommendations for creating better datasets to enhance visual signal leverage.

Abstract

Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. However, recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise, which suggests that the visual context might not be exploited by the model at all. We hypothesize that this might be caused by the nature of the commonly used evaluation benchmark, also known as Multi30K, where the translations of image captions were prepared without actually showing the images to human translators. In this paper, we present a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and we propose methods to highlight the importance of visual signals in the datasets which demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiaodali/vision-matters-when-it-should
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling