On Vision Features in Multimodal Machine Translation

Bei Li; Chuanhao Lv; Zefan Zhou; Tao Zhou; Tong Xiao; Anxiang Ma and; JingBo Zhu

arXiv:2203.09173·cs.CL·March 18, 2022

On Vision Features in Multimodal Machine Translation

Bei Li, Chuanhao Lv, Zefan Zhou, Tao Zhou, Tong Xiao, Anxiang Ma and, JingBo Zhu

PDF

Open Access 2 Repos

TL;DR

This paper investigates how the quality and type of vision models influence multimodal machine translation, highlighting that stronger vision models improve translation and emphasizing the importance of careful evaluation on biased, small-scale benchmarks.

Contribution

It systematically examines the impact of various advanced vision models on MMT and introduces a selective attention approach to analyze image contributions at the patch level.

Findings

01

Stronger vision models enhance translation quality in MMT.

02

Visual features from advanced models contribute significantly to translation.

03

Current benchmarks may be biased and insufficient for comprehensive evaluation.

Abstract

Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the visual modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased. Our code could be found at \url{https://github.com/libeineu/fairseq_mmt}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Translation Studies and Practices

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Dropout