Multi-armed Bandit for Stochastic Shortest Path in Mixed Autonomy
Yu Bai, Yiming Li, Xi Xiong

TL;DR
This paper introduces a novel RTDP-based algorithm incorporating UCB exploration for mixed-autonomy traffic routing, effectively balancing exploration and exploitation to find optimal strategies in stochastic environments.
Contribution
It develops a new RTDP-based multi-armed bandit algorithm with UCB exploration for stochastic routing in mixed-autonomy traffic networks, providing theoretical guarantees and improved efficiency.
Findings
The algorithm guarantees worst-case convergence to optimal policies.
It outperforms standard RTDP in highly stochastic environments.
It demonstrates superior computational efficiency over Value Iteration.
Abstract
In mixed-autonomy traffic networks, autonomous vehicles (AVs) are required to make sequential routing decisions under uncertainty caused by dynamic and heterogeneous interactions with human-driven vehicles (HDVs). Early-stage greedy decisions made by AVs during interactions with the environment often result in insufficient exploration, leading to failures in discovering globally optimal strategies. The exploration-exploitation balancing mechanism inherent in multi-armed bandit (MAB) methods is well-suited for addressing such problems. Based on the Real-Time Dynamic Programming (RTDP) framework, we introduce the Upper Confidence Bound (UCB) exploration strategy from the MAB paradigm and propose a novel algorithm. We establish the path-level regret upper bound under the RTDP framework, which guarantees the worst-case convergence of the proposed algorithm. Extensive numerical experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization · Transportation and Mobility Innovations · Reinforcement Learning in Robotics
