UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese
Doanh C. Bui, Nghia Hieu Nguyen, Khang Nguyen

TL;DR
This paper introduces UIT-OpenViIC, a new Vietnamese image captioning dataset that challenges current models and provides a benchmark for Vietnamese vision-language research, along with a novel captioning approach that improves caption quality.
Contribution
The paper presents a new Vietnamese image captioning dataset and a multi-level encoder fusion method that enhances captioning performance in Vietnamese.
Findings
The dataset is challenging for state-of-the-art models trained on MS-COCO.
The proposed CAMO approach improves caption quality over previous models.
UIT-OpenViIC can serve as a standard benchmark for Vietnamese image captioning.
Abstract
Image Captioning is one of the vision-language tasks that still interest the research community worldwide in the 2020s. MS-COCO Caption benchmark is commonly used to evaluate the performance of advanced captioning models, although it was published in 2015. Recent captioning models trained on the MS-COCO Caption dataset only have good performance in language patterns of English; they do not have such good performance in contexts captured in Vietnam or fluently caption images using Vietnamese. To contribute to the low-resources research community as in Vietnam, we introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC). The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision. In this paper, we present in more detail the dataset creation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
