A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

Jie Zhu; Hanghang Ma; Jia Wang; Yayong Guan; Yanbing Zeng; Lishuai Gao; Junqiang Wu; Jie Hu; Leye Wang

arXiv:2603.04980·cs.CV·March 6, 2026

A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

Jie Zhu, Hanghang Ma, Jia Wang, Yayong Guan, Yanbing Zeng, Lishuai Gao, Junqiang Wu, Jie Hu, Leye Wang

PDF

Open Access 1 Models

TL;DR

Wallaroo is a straightforward autoregressive model that unifies understanding, generation, and editing across multiple modalities, languages, and resolutions, demonstrating competitive performance on various benchmarks.

Contribution

It introduces Wallaroo, a simple yet effective baseline that leverages next-token prediction for multi-modal, bilingual understanding and generation with multi-resolution capabilities.

Findings

01

Competitive performance on multiple benchmarks

02

Supports multi-resolution image input/output

03

Bilingual (Chinese and English) capabilities

Abstract

In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model's capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at https://github.com/JiePKU/Wallaroo.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
jiezhueval/Wallaroo
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques