Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Hyojin Bahng; Caroline Chan; Fredo Durand; Phillip Isola

arXiv:2506.02095·cs.CV·November 4, 2025

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Hyojin Bahng, Caroline Chan, Fredo Durand, Phillip Isola

PDF

Open Access 4 Models 3 Datasets

TL;DR

This paper introduces a novel method for aligning images and text by using cycle consistency as a supervisory signal, eliminating the need for human preferences, and demonstrates improved performance on multiple vision-language tasks.

Contribution

The authors propose CycleReward, a new approach that leverages cycle consistency to learn image-text alignment without human preferences, and release a large preference dataset and models.

Findings

01

CycleReward outperforms state-of-the-art metrics on captioning tasks.

02

The method improves inference scalability and speed.

03

Using the dataset enhances performance in vision-language and text-to-image tasks.

Abstract

Measuring alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset, CycleReward, outperforms state-of-the-art alignment metrics on detailed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques