SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka, Yihan Hu, Wei Zhan

TL;DR
SPACeR introduces a scalable self-play framework for autonomous vehicle simulation that uses a pretrained tokenized model as a reference, achieving human-like behavior with faster inference and smaller models.
Contribution
The paper presents SPACeR, a novel method combining self-play RL with a pretrained tokenized reference model to improve scalability and realism in autonomous vehicle simulation.
Findings
Achieves competitive performance with imitation learning methods.
Up to 10x faster inference than large generative models.
50x smaller in parameter size compared to existing models.
Abstract
Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized…
Peer Reviews
Decision·ICLR 2026 Poster
S1. SPACeR introduces an elegant integration of self-play reinforcement learning with a pretrained tokenized reference model, using KL-divergence alignment and likelihood-based rewards to anchor decentralized policies toward human-like behaviors. The approach is conceptually clean, easy to implement, and avoids heavy reliance on heuristic reward shaping or large generative models. S2. The proposed lightweight decentralized policy (≈65k parameters) achieves over 10× faster inference and 50× smal
W1. The effectiveness of SPACeR heavily relies on the pretrained tokenized reference model. If the reference distribution is biased or limited in coverage, the learned self-play policies may inherit those biases and fail to generalize to unseen or long-tail behaviors. The paper would benefit from a sensitivity or ablation analysis on different reference model qualities. W2. The current experiments are restricted to vehicle agents, without incorporating pedestrians, cyclists, or mixed-traffic in
- Clearly identifies and addresses the gap between imitation learning (realistic but non-reactive) and self-play RL (reactive but unrealistic). - Introduces a principled anchoring mechanism through likelihood and KL-based regularization. - Demonstrates clear ablation and comparative analysis against strong baselines.
- The paper could elaborate on how anchoring parameters affect the trade-off between realism and exploration. - The reference model’s dependence on large imitation datasets may limit generalization to low-data domains. - The related-work section would benefit from citing related studies: A. Kuefler et al., “Imitating Driver Behavior with Generative Adversarial Networks,” IEEE IV 2017. R. P. Bhattacharyya et al., “Modeling Human Driving Behavior through Generative Adversarial Imitation Learni
* This paper tackles an important practical problem: Creating sim agents that are realistic and human-like, yet, fast and cheap to run. * The SPACER results are compelling: The small model performs well on WOSAC on the composite metric and clearly outperforms the baselines in collisions and offroad events. * The ablation results clearly show the importance of the KL divergence term. * The authors provide good examples of WOSAC weaknesses.
* VRUs are not controlled and follow logs, which is a significant shortcoming. * Unclear benefits of the realism reward signal in the ablation. Also see my question below. * While the RL agent inference is fast, training is expensive (first need to train a reference policy, then run inference on it during RL). * The HR-PPO baseline is decentralized, which makes it weaker since SPACER can access a centralized reference model during training. A fairer comparison could be a centralized HR-PPO polic
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Reinforcement Learning in Robotics · Traffic control and management
