UNISON: Unpaired Cross-lingual Image Captioning
Jiahui Gao, Yi Zhou, Philip L. H. Yu, Shafiq Joty, Jiuxiang Gu

TL;DR
This paper introduces UNISON, a novel unpaired cross-lingual image captioning approach that generates captions in a target language without requiring paired datasets, leveraging scene graph encoding and cross-modal feature mapping.
Contribution
The work presents a new unpaired cross-lingual image captioning method that does not rely on caption corpora, enabling scalable caption generation across languages.
Findings
Effective in Chinese image captioning
Outperforms existing methods in experiments
Utilizes scene graph and cross-modal mapping
Abstract
Image captioning has emerged as an interesting research field in recent years due to its broad application scenarios. The traditional paradigm of image captioning relies on paired image-caption datasets to train the model in a supervised manner. However, creating such paired datasets for every target language is prohibitively expensive, which hinders the extensibility of captioning technology and deprives a large part of the world population of its benefit. In this work, we present a novel unpaired cross-lingual method to generate image captions without relying on any caption corpus in the source or the target language. Specifically, our method consists of two phases: (i) a cross-lingual auto-encoding process, which utilizing a sentence parallel (bitext) corpus to learn the mapping from the source to the target language in the scene graph encoding space and decode sentences in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
