CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Marios Krestenitis; Christos Tzelepis; Konstantinos Ioannidis; Stefanos Vrochidis; Ioannis Kompatsiaris; Georgios Tzimiropoulos; Shaogang Gong; Ioannis Patras

arXiv:2603.18282·cs.CV·March 23, 2026

CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Marios Krestenitis, Christos Tzelepis, Konstantinos Ioannidis, Stefanos Vrochidis, Ioannis Kompatsiaris, Georgios Tzimiropoulos, Shaogang Gong, Ioannis Patras

PDF

Open Access

TL;DR

CycleCap introduces a self-supervised cycle consistency fine-tuning method for visual-language models, significantly enhancing image captioning accuracy and grounding without requiring large annotated datasets.

Contribution

The paper proposes CycleCap, a novel self-supervised fine-tuning approach leveraging cycle consistency with pre-trained models, reducing reliance on annotated datasets and improving caption quality.

Findings

01

Consistent improvements across multiple VLMs and benchmarks.

02

Outperforms state-of-the-art supervised cycle consistency methods.

03

Enhances grounding and reduces hallucinations in image captioning.

Abstract

Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling