STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

Ying Shen; Tianrong Chen; Yuan Gao; Yizhe Zhang; Yuyang Wang; Miguel \'Angel Bautista; Shuangfei Zhai; Joshua M. Susskind; Jiatao Gu

arXiv:2605.08029·cs.CV·May 11, 2026

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel \'Angel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu

PDF

TL;DR

STARFlow2 introduces a unified multimodal generation model combining autoregressive normalizing flows with a shared causal structure, enabling efficient interleaved text-image generation and understanding.

Contribution

It proposes a novel architecture that integrates autoregressive normalizing flows with pretrained VLMs for true multimodal unification, addressing structural mismatches in prior methods.

Findings

01

Achieves strong performance on image generation benchmarks.

02

Demonstrates effective multimodal understanding capabilities.

03

Enables cache-friendly interleaved generation of text and images.

Abstract

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.