X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

Zigang Geng; Yibing Wang; Yeyao Ma; Chen Li; Yongming Rao; Shuyang Gu; Zhao Zhong; Qinglin Lu; Han Hu; Xiaosong Zhang; Linus; Di Wang; Jie Jiang

arXiv:2507.22058·cs.CV·July 30, 2025

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang

PDF

2 Models 1 Datasets

TL;DR

X-Omni introduces a reinforcement learning approach to improve discrete autoregressive image generation, achieving state-of-the-art results in image quality and instruction following by integrating language and image generation.

Contribution

The paper presents a novel reinforcement learning framework that enhances discrete autoregressive models for image generation, enabling better quality and instruction adherence.

Findings

01

Achieves state-of-the-art image generation quality.

02

Demonstrates strong instruction-following capabilities.

03

Produces high-aesthetic, detailed images.

Abstract

Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

X-Omni/LongText-Bench
dataset· 184 dl
184 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.