Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding
Dexin Wang, Deyi Xiong

TL;DR
This paper introduces an object-level visual context modeling framework for multimodal machine translation that improves grounding by masking irrelevant objects, leading to better translation performance.
Contribution
The paper proposes a novel object-level visual context modeling framework with masking and weighting strategies to enhance grounding in multimodal translation.
Findings
Outperforms state-of-the-art MMT models
Masking irrelevant objects improves grounding
Vision-weighted translation enhances accuracy
Abstract
Visual context provides grounding information for multimodal machine translation (MMT). However, previous MMT models and probing studies on visual features suggest that visual information is less explored in MMT as it is often redundant to textual information. In this paper, we propose an object-level visual context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation. With detected objects, the proposed OVC encourages MMT to ground translation on desirable visual objects by masking irrelevant objects in the visual modality. We equip the proposed with an additional object-masking loss to achieve this goal. The object-masking loss is estimated according to the similarity between masked objects and the source texts so as to encourage masking source-irrelevant objects. Additionally, in order to generate vision-consistent target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
