STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Jie Qin; Jiancheng Huang; Limeng Qiao; Lin Ma

arXiv:2512.13752·cs.CV·December 17, 2025

STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma

PDF

Open Access 3 Models 3 Reviews

TL;DR

STAR introduces a staged autoregressive framework for unified multimodal learning, effectively balancing understanding, generation, and editing tasks while achieving state-of-the-art results across multiple benchmarks.

Contribution

It proposes a task-progressive autoregressive scheme with parameter freezing and stacking, enhancing multimodal capabilities without interference.

Findings

01

Achieves state-of-the-art on GenEval (0.91)

02

Outperforms on DPG-Bench (87.44)

03

Excels in image editing tasks (4.34)

Abstract

Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The manuscript includes targeted ablations for stack depth, initialization strategy, VQ type, diffusion vs VQ decoder, input strategies for DiT conditioning, which help explain why each design choice was made and where gains come from. 2. Various experiments, comprehensive comparison with existing works. 3. A wide array of experiments is conducted to showcase the effectiveness of STAR.

Weaknesses

1. Table 2 and Table 3 are misleading, as bold text should highlight the best model. The authors may want to reconsider whether Table 1 is necessary, as the image understanding capabilities are entirely inherited from the frozen VLM. 2. There are a few typos: issues with plural and non-plural forms, and “autoregressive (AR)” should be used as an adjective, not a noun. 3. Key claimed novelties are combinations of known ingredients. While the paper proposes the “stacked isomorphic AR layers + ST

Reviewer 02Rating 4Confidence 4

Strengths

1. Strong Empirical Performance: The paper provides comprehensive experimental validation. STAR achieves state-of-the-art (SOTA) results on multiple generation benchmarks, including GenEval, DPG-Bench, and ImgEdit. As shown in Table 1, the model also retains competitive performance on a wide array of understanding benchmarks (MMStar, SEED, MME, OCRBench), demonstrating that the generative extensions did not lead to a significant degradation of the base model's comprehension abilities. 2. High-C

Weaknesses

1. Limited Technical Novelty: The core technical proposal—stacking additional, isomorphic layers onto a frozen backbone—is a straightforward and well-established technique in transfer learning. While its application to unified MLLMs is shown to be effective, the underlying mechanism lacks significant technical novelty and could be viewed as an incremental engineering contribution rather than a fundamental advance in model architecture or training paradigms. 2. Unanalyzed Computational Cost and

Reviewer 03Rating 8Confidence 4

Strengths

The paper proposes a novel stacked autoregressive (STAR) architecture for unified multimodal learning that allows progressive expansion from understanding to text-to-image generation and image editing without retraining or catastrophic forgetting. The idea of stacking isomorphic AR modules as “frozen base + appended heads” represents a creative rethinking of multimodal model scaling. The overall technical design is sound and well-motivated. The proposed STAR-VQ tokenizer, the modular training cu

Weaknesses

1. Missing comparison. Some methods[1,2,3] are missing in experiments. Inclusion of such baselines would clarify STAR’s relative advantage and limitations 2. Limited theoretical justification. The paper provides strong empirical validation but offers little theoretical analysis explaining why the stacked autoregressive (AR) expansion avoids optimization interference. A more formal discussion of gradient isolation or representational decoupling between frozen and stacked modules would strengthen

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning