Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling
Xiaolong Tang, Meina Kan, Shiguang Shan, Xilin Chen

TL;DR
Plan-R1 introduces a two-stage trajectory planning framework for autonomous driving that combines human-like behavior modeling with explicit safety and rule adherence, using a novel variance-decoupled optimization method.
Contribution
It proposes a decoupled two-stage learning approach with a new Variance-Decoupled GRPO to improve safety and feasibility in autonomous trajectory planning.
Findings
Achieves state-of-the-art safety and feasibility on nuPlan benchmark.
Effectively balances human-like behavior with safety constraints.
Outperforms existing methods in reactive driving scenarios.
Abstract
Safe and feasible trajectory planning is critical for real-world autonomous driving systems. However, existing learning-based planners rely heavily on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting undesirable behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. In the first stage, a general trajectory predictor is pre-trained on expert data to capture diverse, human-like driving behaviors. In the second stage, the model is fine-tuned with rule-based rewards using Group Relative Policy Optimization (GRPO), explicitly aligning ego planning with principles such as safety, comfort, and traffic rule compliance. This two-stage paradigm retains human-like behaviors…
Peer Reviews
Decision·ICLR 2026 Poster
• Clear two-stage framework: Plan-R1 decouples behavior learning from principle alignment, retaining human-like behavior while enhancing safety awareness and removing undesirable patterns present in expert data (p.1, lines 054–058). • Novel VD-GRPO: In response to standard GRPO’s limitations, VD-GRPO replaces in-group normalization with centering and fixed scaling, effectively preventing rare but critical safety-violation signals from being washed out and ensuring safety-critical objectives do
• Definition of pivots: The paper states that trajectories are discretized into motion tokens but does not delve into how these “pivot” points are chosen or defined, nor whether such discretization might miss key kinematic or geometric features (p.5, lines 217–223). • Detailed analysis of VD-GRPO: Although VD-GRPO is proposed, there is limited theoretical analysis of how its parameters (e.g., the fixed scaling constant c) affect training dynamics; the discussion is mostly empirical (p.9, lines
The paper introduces a novel and well-motivated idea of framing trajectory planning as language modeling through the two-stage Plan-R1 framework. The decoupling of behavior learning and principle alignment is conceptually elegant and practically effective. The proposed VD-GRPO clearly addresses a key limitation in standard GRPO, preserving safety-critical gradients and improving rare-event optimization. Experiments on the nuPlan benchmark are extensive and convincing, with strong gains in both n
While the empirical results are strong, the theoretical justification of VD-GRPO remains limited. The fixed scaling constant is treated as a hyperparameter without principled analysis of its effect on convergence or stability. The dual-model setting, with a frozen world model, may introduce distribution drift in long-term interactions. Moreover, evaluation is restricted to nuPlan; results on other benchmarks such as CARLA or Waymo would strengthen claims of generalization.
1. The authors clearly diagnose how per-group variance normalization in GRPO suppresses rare but safety-critical violations, and propose a principled solution that consistently enhances safety optimization without sacrificing secondary objectives. 2. Using a frozen world model for surrounding-agent responses ensures interaction-aware rollouts while preventing instability in non-ego behaviors. This design proves crucial for strong performance in reactive scenarios. 3. Plan-R1 achieves new state
1. Limited analysis of world model reliability when ego deviates from expert behavior. The frozen world model is assumed to remain accurate when the ego policy explores beyond regions well-covered by expert data. However, there is no case study or quantitative analysis showing how the world model behaves under large ego deviations or unusual interaction patterns, which could introduce compounding errors during RL fine-tuning. 2. Tokenization design lacks ablation on discretization choices. The
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Natural Language Processing Techniques · Model-Driven Software Engineering Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN
