Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang

TL;DR
Lumos-1 introduces a unified autoregressive video generation model based on large language models, utilizing novel frequency-aware positional encoding and a parallel discrete diffusion process to improve efficiency and quality.
Contribution
The paper proposes MM-RoPE for better spatiotemporal modeling and a parallel mask-based discrete diffusion method with autoregressive training, advancing unified LLM-based video generation.
Findings
Lumos-1 outperforms existing models on multiple video generation benchmarks.
The proposed MM-RoPE effectively models spatiotemporal correlations in videos.
Autoregressive discrete diffusion improves generation quality with limited data.
Abstract
Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive (AR) video generation. Existing AR video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an LLM-based unified model for AR video generation with efficient discrete diffusion. Firstly, to fit videos with LLMs, we identify that 1D RoPE is ill-suited for visual spatiotemporal correlation modeling, and while demonstrated to be useful, naive 3D RoPE exhibits imbalanced frequency spectra. Therefore, we propose MM-RoPE, which preserves the original textual RoPE while seamlessly accommodating video data with comprehensive frequency spectra and scaled 3D positions. Secondly, to fit the video data's nature and overcome…
Peer Reviews
Decision·ICLR 2026 Poster
1. The study of RoPE frequency allocation is genuinely interesting. The proposed MM-RoPE makes sense — it fixes some imbalance in 3D RoPE and doesn’t really cost extra compute. 2. The setup is efficient: 60M images and 10M videos, trained on 48 H20 GPUs. 3. The experiments are complete, covering main benchmarks (T2I, T2V, I2V) and solid ablations.
1. The biggest weakness is the performance. Even though the paper claims efficiency, Lumos-1 doesn’t really beat diffusion or existing AR models on any benchmark. The visuals still look blurry and the motion often feels weird or distorted. It’s not convincing that this setup improves things beyond simplicity. 2. The experimental part only uses validation loss to validate different component like AR-DF and MM-RoPE. Including benchmark metrics would better demonstrate effectiveness since validatio
- Lumos-1 demonstrates that a pure LLM architecture can be employed for autoregressive video generation. - Lumos-1 addresses the frequency limitations of the original M-RoPE. - Lumos-1 proposes AR-DF scheme to mitigate the training-inference inconsistency.
- **Reconsidering the Masking Strategy of AR-DF.** Diffusion forcing[1] models the conditional probabilities between frames by adding independent noise to each frame. However, due to the issue of information leakage in mask-based approaches, AR-DF can only apply the same noise to all frames, which deviates from the core idea of diffusion forcing. In my view, AR-DF is more like a form of data augmentation or an exploration of better masking strategies for video generation. I believe the authors n
1. This work proposes Lumios, which demonstrates the effectiveness of video generation using LLM architecture, paving the way for a truly unified foundation model and eliminating the need for external text encoders. 2. It proposes MM-ROPE to address imbalanced frequency spectrums in 3D RoPE, enhancing spatiotemporal correlation modeling via distributed channel allocation and scaled 3D positions. 3. Lumos-1 employs autoregressive discrete diffusion forcing (AR-DF) to mitigate frame-wise loss imb
1. Although structurally similar to Llama, Lumos-1 is trained from scratch, requiring simultaneous learning of language and vision, which may lead to training instability and inefficiency. The validation curve in Figure 7(a) supports concerns about the instability. Furthermore, as a foundation model, the full cost of the training, particularly in terms of time, is not disclosed, which is a cause for concern in terms of the inefficiency of the training. 2. The paper claims minimal structural modi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
