On Leveraging the Visual Modality for Neural Machine Translation
Vikas Raunak, Sang Keun Choe, Quanyang Lu, Yi Xu, Florian Metze

TL;DR
This paper investigates the role of visual information in neural machine translation using a larger, more complex dataset, proposing new fusion methods but finding limited benefits due to the quality of visual embeddings.
Contribution
It introduces three novel fusion techniques for integrating visual context in NMT and analyzes the impact of visual embedding quality on translation performance.
Findings
Marginal gains from visual context in large-scale datasets
Visual embeddings' discriminativeness is insufficient for improved translation
Quality of visual embeddings is crucial for effective multimodal NMT
Abstract
Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAverage Pooling · ResNeXt Block · Grouped Convolution · Global Average Pooling · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Kaiming Initialization · 1x1 Convolution · Convolution · Batch Normalization
