CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

Xiaojun Shan; Haoyu Shen; Yucheng Mao; Xiang Zhang; Abhay Anand; Bingnan Li; Haiyang Xu; Zhuowen Tu

arXiv:2603.14957·cs.CV·March 17, 2026

CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

Xiaojun Shan, Haoyu Shen, Yucheng Mao, Xiang Zhang, Abhay Anand, Bingnan Li, Haiyang Xu, Zhuowen Tu

PDF

Open Access

TL;DR

CyCLeGen is a unified vision-language model that integrates image understanding and generation through cycle-consistent learning, improving reasoning and data efficiency, and achieving strong results across multiple benchmarks.

Contribution

It introduces a fully integrated autoregressive model with cycle consistency for joint image understanding and generation, enabling self-reflection and data-efficient learning.

Findings

01

Significant performance improvements on diverse benchmarks

02

Enhanced reasoning capabilities through cycle consistency

03

Effective self-supervised learning via synthetic supervision

Abstract

We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning