Visual Agreement Regularized Training for Multi-Modal Machine Translation
Pengcheng Yang, Boxing Chen, Pei Zhang, Xu Sun

TL;DR
This paper introduces a visual agreement regularized training method for multi-modal machine translation, jointly training translation models to better utilize visual information and improve translation accuracy.
Contribution
It proposes a novel training approach that encourages models to focus consistently on visual features, along with a multi-head co-attention mechanism for enhanced visual-text interaction.
Findings
Outperforms baseline models on Multi30k dataset
Improves attention agreement on visual features
Enhances use of visual information in translation
Abstract
Multi-modal machine translation aims at translating the source sentence into a different language in the presence of the paired image. Previous work suggests that additional visual information only provides dispensable help to translation, which is needed in several very special cases such as translating ambiguous words. To make better use of visual information, this work presents visual agreement regularized training. The proposed approach jointly trains the source-to-target and target-to-source translation models and encourages them to share the same focus on the visual information when generating semantically equivalent visual words (e.g. "ball" in English and "ballon" in French). Besides, a simple yet effective multi-head co-attention model is also introduced to capture interactions between visual and textual features. The results show that our approaches can outperform competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
