Bootstrapping Disjoint Datasets for Multilingual Multimodal Representation Learning
\'Akos K\'ad\'ar, Grzegorz Chrupa{\l}a, Afra Alishahi, Desmond Elliott

TL;DR
This paper introduces a pseudopairing method to improve multilingual multimodal representation learning from disjoint datasets, bridging the performance gap with aligned datasets without external data.
Contribution
The paper proposes a novel pseudopairing approach that synthetically aligns disjoint multilingual image-caption datasets, enhancing retrieval performance.
Findings
Pseudopairing improves image--sentence retrieval performance.
Using external machine translation yields better results.
Disjoint datasets can be effectively aligned with synthetic pairing.
Abstract
Recent work has highlighted the advantage of jointly learning grounded sentence representations from multiple languages. However, the data used in these studies has been limited to an aligned scenario: the same images annotated with sentences in multiple languages. We focus on the more realistic disjoint scenario in which there is no overlap between the images in multilingual image--caption datasets. We confirm that training with aligned data results in better grounded sentence representations than training with disjoint data, as measured by image--sentence retrieval performance. In order to close this gap in performance, we propose a pseudopairing method to generate synthetically aligned English--German--image triplets from the disjoint sets. The method works by first training a model on the disjoint data, and then creating new triples across datasets using sentence similarity under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
