Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

Aron Distelzweig; Faris Janjo\v{s}; Andreas Look; Anna Rothenh\"ausler; Daniel Jost; Oliver Scheel; Raghu Rajan; Daphne Cornelisse; Eugene Vinitsky; Joschka Boedecker

arXiv:2605.10034·cs.RO·May 12, 2026

Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

Aron Distelzweig, Faris Janjo\v{s}, Andreas Look, Anna Rothenh\"ausler, Daniel Jost, Oliver Scheel, Raghu Rajan, Daphne Cornelisse, Eugene Vinitsky, Joschka Boedecker

PDF

TL;DR

BehaviorBench is a comprehensive evaluation suite for autonomous driving policies, addressing evaluation, complexity, and behavior diversity to better assess generalization and robustness of RL-trained policies.

Contribution

We introduce BehaviorBench, a new benchmark that connects RL policies to established datasets, evaluates complex interactions, and tests against diverse traffic behaviors.

Findings

01

RL policies overfit to training opponents and fail to generalize.

02

A meaningful, interaction-rich split from WOMD reveals the need for multi-agent reasoning.

03

A hybrid PPO and rule-based planner improves robustness.

Abstract

Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.