Masked Generative Policy for Robotic Control
Lipeng Zhuang, Shiyu Fan, Florent P. Audonnet, Yingdong Ru, Edmond S. L. Ho, Gerardo Aragon Camarasa, Paul Henderson

TL;DR
The paper introduces Masked Generative Policy (MGP), a transformer-based framework for robotic control that improves success rates and inference speed in complex, non-Markovian tasks through novel masked token generation and refinement strategies.
Contribution
It proposes a new masked transformer approach for visuomotor imitation learning, enabling rapid, coherent, and adaptive control in complex robotic tasks, outperforming prior diffusion and autoregressive methods.
Findings
Achieves 9% higher success rate across 150 tasks.
Reduces inference time by up to 35x.
Solves non-Markovian scenarios where others fail.
Abstract
We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior…
Peer Reviews
Decision·ICLR 2026 Poster
The authors conducted a thorough analysis of current action generation methods and proposed MGP to address the latency issues inherent in diffusion-style or autoregressive-style action generation. The paper is clearly articulated and easy to follow. The concept of using MGP to re-predict tokens with low confidence while maintaining those with high confidence is intriguing. Theoretically, this approach could indeed reduce the time consumed in predicting actions.
1. I acknowledge that the results in the simulated environment are impressive. However, due to the sim-to-real gap, it is often necessary to demonstrate effectiveness in real-world settings within this field. 2. Regarding the confidence score. Could you analyze the situations that might lead to a lower confidence score? Additionally, how can we ensure the accuracy of the confidence score itself? 3. About the MGP-Long settings. In long sequences, certain objects may cause environmental changes
- Unlike diffusion-based policy, which might require external distillation for fast inference speed, MGP puts less stress on iterative sampling for obtaining clean actions, and has high flexibility of test-time adjustment with proposed sampling strategies. - MGP-Long iteratively refines the action tokens using the executed actions along with the updated observation to improve trajectory-level coherence, which achieves strong performance in Non-Markovian and dynamic environments, and remains robu
- Baselines such as diffusion-based policies (e.g. ) as well as VQ-BeT stand out when learning multimodal action distributions, while MGP is also built on top of vector quantization, it is not yet clear how the proposed sampling methods work on tasks with explicit multimodality - As all tokens are predicted in parallel, the refinement process can be affected if there are low-quality actions predicted initially with high confidence, causing error accumulation throughout the following iterations.
Original idea: creatively transfers masked-generation paradigms (MaskGIT/MUSE) to robotic action synthesis. Technical soundness: clearly defined VQ-VAE tokenizer, transformer conditioning, and confidence-guided refinement loop. Empirical rigor: evaluated on 150+ tasks across difficulty levels; includes robustness tests (dynamic, missing-observation, non-Markovian). Fair comparison: benchmarks against continuous-action (diffusion/flow) and discrete-token baselines under identical encoders and
Limited analysis of tokenizer sensitivity: performance may depend on the VQ-VAE codebook design, but this is not explored. Hyperparameter transparency: the exact confidence-masking threshold and its effect on refinement stability are not analyzed. Potential complexity: the two-stage training (tokenizer + policy) increases implementation effort; joint end-to-end training would strengthen the approach.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis
