TL;DR
ZTRS is a novel end-to-end autonomous driving framework that eliminates imitation learning by exclusively using reinforcement learning on raw sensor data, achieving state-of-the-art results across multiple benchmarks.
Contribution
ZTRS introduces the first framework to replace imitation learning entirely with reward-based reinforcement learning on high-dimensional sensor inputs for autonomous driving.
Findings
Achieves state-of-the-art on Navhard benchmark.
Outperforms imitation learning baselines on HUGSIM.
Demonstrates strong performance in real-world and synthetic scenarios.
Abstract
End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The figures are very clear and explain the arguments and descriptions well. In general the idea to score the output to get rewards for a Reinforcement Learning approach is good for the autonomous driving domain. The approach was tested on a larger range of benchmarks and seems to be close or at the state-of-the-art. Useful ablation studies, e.g. on the size of the action space, allow to estimate the robustness of the approach.
The Exhaustive Policy Optimization seems to have strong similarities with Group Relative Policy Optimization but calling it differently. The authors should either consider framing their approach within the context of GRPO or explain how it is different in the related work. The language of the paper could be improved. The conclusion states that the approach "eliminates" imitation learning. This is not something which can be expected from any single approach in the near future nor a statement wit
1. The goal of reducing or eliminating the dependency on large-scale, high-quality expert demonstrations (IL) is a significant and well-motivated problem in autonomous driving. 2. The paper presents strong results on the challenging Navhard benchmark, outperforming prior IL-based methods. This demonstrates that the proposed RL-based approach can be effectively optimized for the given offline metrics.
1. The "trajectory scorer" paradigm is not new and has been extensively used by many prior works cited in this paper (e.g., Hydra-MDP, GTRS-Dense, DriveSuprim). The overall architecture (image backbone, trajectory tokenizer, transformer decoder) is standard for this class of models. 2. The method presented as EPO is a direct and standard application of the Policy Gradient Theorem for a discrete, enumerable action space. When the action space is small enough to enumerate, computing the full su
* End-to-end solution for Autonomous Driving * Uses differentiable autonomous driving stack without VLM * Uses Reinforcement Learning w/o IL where the advantage is defined by EPDMS $\varepsilon$ with an optional correction term $b$ * Done by so called EPO which is a version of the policy gradient for aoffline data and enumerable actions
* No mentioning of CIMRL work [1] that also tries to learn RL-based trajectory scorer on top of any trajectory source (could be either IL or RL-based); it seems like a good reference fit in the main idea of the paper + Section "4.2 RL For AD", symbolic-input methods * No rigorous ablations on rewards (what to include and what to exclude), and sampling over the sum of $m+1$ trajectories (Section 2.1) * Unclear why weight decay factor is equal to 0.0, not to a small number * The results using the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
