CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation
Devaansh Gupta, Siddhant Kharbanda, Jiawei Zhou, Wanhua Li, Hanspeter, Pfister, Donglai Wei

TL;DR
CLIPTrans leverages pre-trained multilingual and multimodal models to improve multimodal machine translation, achieving state-of-the-art results without complex new modules by aligning embedding spaces through a lightweight mapping.
Contribution
It introduces a simple adaptation method that aligns pre-trained models for effective multimodal translation, especially in low-resource language scenarios.
Findings
Achieves an average of +2.67 BLEU over benchmarks.
Effectively aligns multilingual and multimodal embeddings.
Demonstrates strong generalization in low-resource settings.
Abstract
There has been a growing interest in developing multimodal machine translation (MMT) systems that enhance neural machine translation (NMT) with visual knowledge. This problem setup involves using images as auxiliary information during training, and more recently, eliminating their use during inference. Towards this end, previous works face a challenge in training powerful MMT models from scratch due to the scarcity of annotated multilingual vision-language data, especially for low-resource languages. Simultaneously, there has been an influx of multilingual pre-trained models for NMT and multimodal pre-trained models for vision-language tasks, primarily in English, which have shown exceptional generalisation ability. However, these are not directly applicable to MMT since they do not provide aligned multimodal multilingual features for generative tasks. To alleviate this issue, instead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation· youtube
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsALIGN · mBART
