Neural Machine Translation with Phrase-Level Universal Visual Representations
Qingkai Fang, Yang Feng

TL;DR
This paper introduces a phrase-level retrieval approach for multimodal machine translation that leverages existing sentence-image datasets to enhance translation quality without requiring paired data, using visual representations filtered by a variational auto-encoder.
Contribution
It proposes a novel phrase-level retrieval method combined with a conditional variational auto-encoder to improve multimodal translation by mitigating data sparsity and filtering redundant visual info.
Findings
Significant performance improvements over baselines on multiple datasets
Effective handling of limited textual context
Mitigation of data sparsity issues in multimodal translation
Abstract
Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs. In this paper, we propose a phrase-level retrieval-based method for MMT to get visual information for the source input from existing sentence-image data sets so that MMT can break the limitation of paired sentence-image input. Our method performs retrieval at the phrase level and hence learns visual information from pairs of source phrase and grounded region, which can mitigate data sparsity. Furthermore, our method employs the conditional variational auto-encoder to learn visual representations which can filter redundant visual information and only retain visual information related to the phrase. Experiments show that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
