GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation
Jiafeng Xiong, Yuting Zhao

TL;DR
GIIFT introduces a novel graph-guided framework for multimodal machine translation that effectively leverages visual information during training and generalizes well to image-free inference, achieving state-of-the-art results.
Contribution
It proposes a two-stage graph-guided inductive framework with a cross-modal attention network that enhances image-free translation by integrating modality-specific information.
Findings
Surpasses existing methods on Multi30K dataset.
Achieves state-of-the-art results without using images during inference.
Shows significant improvements on WMT benchmark.
Abstract
Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques
