Efficient Object-Level Visual Context Modeling for Multimodal Machine   Translation: Masking Irrelevant Objects Helps Grounding

Dexin Wang; Deyi Xiong

arXiv:2101.05208·cs.CV·January 14, 2021·1 cites

Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding

Dexin Wang, Deyi Xiong

PDF

Open Access 1 Video

TL;DR

This paper introduces an object-level visual context modeling framework for multimodal machine translation that improves grounding by masking irrelevant objects, leading to better translation performance.

Contribution

The paper proposes a novel object-level visual context modeling framework with masking and weighting strategies to enhance grounding in multimodal translation.

Findings

01

Outperforms state-of-the-art MMT models

02

Masking irrelevant objects improves grounding

03

Vision-weighted translation enhances accuracy

Abstract

Visual context provides grounding information for multimodal machine translation (MMT). However, previous MMT models and probing studies on visual features suggest that visual information is less explored in MMT as it is often redundant to textual information. In this paper, we propose an object-level visual context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation. With detected objects, the proposed OVC encourages MMT to ground translation on desirable visual objects by masking irrelevant objects in the visual modality. We equip the proposed with an additional object-masking loss to achieve this goal. The object-masking loss is estimated according to the similarity between masked objects and the source texts so as to encourage masking source-irrelevant objects. Additionally, in order to generate vision-consistent target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling