Improving Captioning for Low-Resource Languages by Cycle Consistency
Yike Wu, Shiwan Zhao, Jia Chen, Ying Zhang, Xiaojie Yuan, Zhong Su

TL;DR
This paper introduces a unified model leveraging cycle consistency to improve captioning in low-resource languages by combining translation and alignment strategies, utilizing English datasets to enhance accuracy and alignment quality.
Contribution
The paper proposes a novel architecture that integrates translation and alignment approaches with cycle consistency, enabling effective use of large English caption datasets for low-resource language captioning.
Findings
Outperforms state-of-the-art methods on standard metrics
Improves fine-grained word-region alignment
Effectively leverages monolingual English datasets
Abstract
Improving the captioning performance on low-resource languages by leveraging English caption datasets has received increasing research interest in recent years. Existing works mainly fall into two categories: translation-based and alignment-based approaches. In this paper, we propose to combine the merits of both approaches in one unified architecture. Specifically, we use a pre-trained English caption model to generate high-quality English captions, and then take both the image and generated English captions to generate low-resource language captions. We improve the captioning performance by adding the cycle consistency constraint on the cycle of image regions, English words, and low-resource language words. Moreover, our architecture has a flexible design which enables it to benefit from large monolingual English caption datasets. Experimental results demonstrate that our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
