AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Dongjie Cheng; Ruifeng Yuan; Yongqi Li; Runyang You; Wenjie Wang; Liqiang Nie; Lei Zhang; Wenjie Li

arXiv:2601.17761·cs.LG·January 27, 2026

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Dongjie Cheng, Ruifeng Yuan, Yongqi Li, Runyang You, Wenjie Wang, Liqiang Nie, Lei Zhang, Wenjie Li

PDF

Open Access 2 Models 1 Datasets

TL;DR

AR-Omni introduces a unified autoregressive model capable of any-to-any multimodal generation, supporting text, image, and speech outputs within a single Transformer decoder, simplifying multimodal AI systems.

Contribution

It presents AR-Omni, the first unified autoregressive model that handles multiple modalities without expert decoders, improving simplicity and scalability in multimodal generation.

Findings

01

Achieves high-quality multimodal generation across text, image, and speech.

02

Operates in real-time with a 0.88 factor for speech generation.

03

Addresses practical challenges like modality imbalance and visual fidelity.

Abstract

Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of "Omni" MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity of unified training and inference. Autoregressive (AR) modeling, with a single token stream, a single next-token objective, and a single decoder, is an elegant and scalable foundation in the text domain. Motivated by this, we present AR-Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders. AR-Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder. We further address three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

ModalityDance/AR-Omni-Instruct-v0.1
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications