Partition Generative Modeling: Masked Modeling Without Masks
Justin Deschenaux, Lan Tran, Caglar Gulcehre

TL;DR
Partition Generative Models (PGMs) replace masking with partitioning to enable parallel, any-order token generation while processing only relevant tokens during sampling, achieving higher throughput and comparable quality to existing models.
Contribution
PGMs introduce a novel partitioning approach that eliminates masks, combining the advantages of MGMs and ARMs for efficient, flexible generative modeling.
Findings
PGMs achieve 5-5.5x higher throughput on OpenWebText.
PGMs produce lower perplexity samples than MDLM.
On ImageNet, PGMs reach comparable FID with 7.5x throughput improvement.
Abstract
Masked generative models (MGMs) can generate tokens in parallel and in any order, unlike autoregressive models (ARMs), which decode one token at a time, left-to-right. However, MGMs process the full-length sequence at every sampling step, including mask tokens that carry no information. In contrast, ARMs process only the previously generated tokens. We introduce ``Partition Generative Models'' (PGMs), which replace masking with partitioning. Tokens are split into two groups that cannot attend to each other, and the model learns to predict each group conditioned on the other, eliminating mask tokens entirely. Because the groups do not interact, PGMs can process only the clean tokens during sampling, like ARMs, while retaining parallel, any-order generation, like MGMs. On OpenWebText, PGMs achieve higher throughput than MDLM while producing samples with lower Generative…
Peer Reviews
Decision·ICLR 2026 Oral
- The GroupSwap layer and partition-aware transformer structure are well-motivated - Includes analyses of perplexity, latency, throughput, and ablations on masking vs. partitioning. - Strong empirical results across both text and image generation tasks, PGMs deliver substantial inference speedups (up to 7×) with little to no degradation in output quality.
- The architectural details (e.g., data-dependent vs. data-independent queries) are dense and could be clarified or simplified, the paper is a bit difficult to follow. - The largest experiments are modest in size (268M parameters). It remains unclear if PGMs scale favorably compared to state-of-the-art large AR or diffusion model - No comparison against recent SOTA model non-autoregressive language models beyond MDLM.
1. The core idea of avoiding computation on masked tokens during inference, along with the corresponding training strategy, is interesting and effectively targets a key inefficiency in existing masked generative models. 2. The empirical results demonstrate that PGM can significantly accelerate inference while maintaining generation quality comparable to other state-of-the-art generative models, supporting the practical value of the proposed approach. 3. The paper is clearly written, well-struc
I did not identify any major weaknesses in this paper. I do, however, have one question for clarification: The proposed training pipeline includes two prediction components that operate on the same batch of data, which suggests that training efficiency could potentially be better than MDLM. Could the authors provide quantitative results or analysis regarding training efficiency, such as training speed, computational cost, or resource usage compared to MDLM?
- The empirical benefit is strong: 5x faster than MGM (4.6x faster with nucleus sampling). - Complementary masking is a smart and original trick to let one training step effectively count as two steps. - Section 5.3: fair comparison against MDLM (MGM) by isolating the complementary masking trick. - The down-stream tasks spreads across image and language, and the evaluation is solid. Distillation is also explored, which improves the practical significance of the paper.
- The fairness of Table 2's comparison is not immediately visible—I believe the fairness should outweigh matching performance. Since the paper switches from decoder-only to encoder-decoder architecture, controlling hyperparameters (width, head, depth and MLP width multipliers) seems crucial to get a fair comparison. In LM1B, it is a good idea controlling parameter counts and comparing with PGM(6/6)\~170M, but in OWT, that model is missing in the main text (only the dim. 1024 model is shown). I d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Modeling in Geospatial Applications
MethodsSoftmax · Attention Is All You Need · Diffusion · Probability Guided Maxout
