STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel \'Angel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu

TL;DR
STARFlow2 introduces a unified multimodal generation model combining autoregressive normalizing flows with a shared causal structure, enabling efficient interleaved text-image generation and understanding.
Contribution
It proposes a novel architecture that integrates autoregressive normalizing flows with pretrained VLMs for true multimodal unification, addressing structural mismatches in prior methods.
Findings
Achieves strong performance on image generation benchmarks.
Demonstrates effective multimodal understanding capabilities.
Enables cache-friendly interleaved generation of text and images.
Abstract
Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
