Behavior-Consistent Deep Reinforcement Learning
Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton

TL;DR
This paper introduces Q-value Expectile Disagreement (QED), a novel method for behavior-consistent deep reinforcement learning that significantly reduces policy divergence across training runs while maintaining high performance.
Contribution
The paper formalizes behavior-consistent RL, proves bounds for Boltzmann policies, and proposes QED, a new temperature schedule based on double-critic disagreement to reduce cross-run divergence.
Findings
QED reduces cross-run divergence by two orders of magnitude.
QED maintains high performance across 18 continuous-control tasks.
QED achieves this with modest sample-efficiency costs.
Abstract
Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to -function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that na\"ively increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
