STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning
Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma

TL;DR
STAR introduces a staged autoregressive framework for unified multimodal learning, effectively balancing understanding, generation, and editing tasks while achieving state-of-the-art results across multiple benchmarks.
Contribution
It proposes a task-progressive autoregressive scheme with parameter freezing and stacking, enhancing multimodal capabilities without interference.
Findings
Achieves state-of-the-art on GenEval (0.91)
Outperforms on DPG-Bench (87.44)
Excels in image editing tasks (4.34)
Abstract
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The manuscript includes targeted ablations for stack depth, initialization strategy, VQ type, diffusion vs VQ decoder, input strategies for DiT conditioning, which help explain why each design choice was made and where gains come from. 2. Various experiments, comprehensive comparison with existing works. 3. A wide array of experiments is conducted to showcase the effectiveness of STAR.
1. Table 2 and Table 3 are misleading, as bold text should highlight the best model. The authors may want to reconsider whether Table 1 is necessary, as the image understanding capabilities are entirely inherited from the frozen VLM. 2. There are a few typos: issues with plural and non-plural forms, and “autoregressive (AR)” should be used as an adjective, not a noun. 3. Key claimed novelties are combinations of known ingredients. While the paper proposes the “stacked isomorphic AR layers + ST
1. Strong Empirical Performance: The paper provides comprehensive experimental validation. STAR achieves state-of-the-art (SOTA) results on multiple generation benchmarks, including GenEval, DPG-Bench, and ImgEdit. As shown in Table 1, the model also retains competitive performance on a wide array of understanding benchmarks (MMStar, SEED, MME, OCRBench), demonstrating that the generative extensions did not lead to a significant degradation of the base model's comprehension abilities. 2. High-C
1. Limited Technical Novelty: The core technical proposal—stacking additional, isomorphic layers onto a frozen backbone—is a straightforward and well-established technique in transfer learning. While its application to unified MLLMs is shown to be effective, the underlying mechanism lacks significant technical novelty and could be viewed as an incremental engineering contribution rather than a fundamental advance in model architecture or training paradigms. 2. Unanalyzed Computational Cost and
The paper proposes a novel stacked autoregressive (STAR) architecture for unified multimodal learning that allows progressive expansion from understanding to text-to-image generation and image editing without retraining or catastrophic forgetting. The idea of stacking isomorphic AR modules as “frozen base + appended heads” represents a creative rethinking of multimodal model scaling. The overall technical design is sound and well-motivated. The proposed STAR-VQ tokenizer, the modular training cu
1. Missing comparison. Some methods[1,2,3] are missing in experiments. Inclusion of such baselines would clarify STAR’s relative advantage and limitations 2. Limited theoretical justification. The paper provides strong empirical validation but offers little theoretical analysis explaining why the stacked autoregressive (AR) expansion avoids optimization interference. A more formal discussion of gradient isolation or representational decoupling between frozen and stacked modules would strengthen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
