SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Si-Woo Kim; MinJu Jeon; Ye-Chan Kim; Soeun Lee; Taewhan Kim; Dong-Jin Kim

arXiv:2507.18616·cs.CV·July 25, 2025

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim

PDF

Open Access

TL;DR

SynC is a novel framework that refines synthetic image-caption datasets for zero-shot image captioning by reassigning captions to semantically aligned images using a one-to-many retrieval approach, improving model performance.

Contribution

SynC introduces a new dataset refinement method that reassigns captions to existing images through a one-to-many retrieval strategy, addressing semantic misalignments in synthetic data for ZIC.

Findings

01

Significantly improves ZIC model performance on benchmarks

02

Achieves state-of-the-art results in multiple scenarios

03

Effectively refines synthetic datasets for better caption-image alignment

Abstract

Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques