VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Maitreya Patel; Jingtao Li; Weiming Zhuang; Yezhou Yang; Lingjuan Lv

arXiv:2604.24885·cs.CV·April 29, 2026

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Maitreya Patel, Jingtao Li, Weiming Zhuang, Yezhou Yang, Lingjuan Lv

PDF

TL;DR

VibeToken introduces a resolution-agnostic 1D Transformer image tokenizer enabling efficient, flexible autoregressive image synthesis across arbitrary resolutions with significantly reduced computational costs.

Contribution

The paper presents VibeToken, a novel 1D Transformer-based image tokenizer that generalizes to any resolution, and VibeToken-Gen, an efficient AR generator requiring fewer resources.

Findings

01

VibeToken-Gen synthesizes 1024x1024 images with only 64 tokens and 3.94 gFID.

02

VibeToken-Gen maintains constant FLOPs regardless of resolution, unlike fixed-resolution models.

03

VibeToken achieves state-of-the-art efficiency and performance trade-offs in image synthesis.

Abstract

We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.