Behavior-Consistent Deep Reinforcement Learning

Marcel Hussing; Liv G. d'Aliberti; Claas Voelcker; Benjamin Eysenbach; Eric Eaton

arXiv:2605.21214·cs.LG·May 22, 2026

Behavior-Consistent Deep Reinforcement Learning

Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton

PDF

TL;DR

This paper introduces Q-value Expectile Disagreement (QED), a novel method for behavior-consistent deep reinforcement learning that significantly reduces policy divergence across training runs while maintaining high performance.

Contribution

The paper formalizes behavior-consistent RL, proves bounds for Boltzmann policies, and proposes QED, a new temperature schedule based on double-critic disagreement to reduce cross-run divergence.

Findings

01

QED reduces cross-run divergence by two orders of magnitude.

02

QED maintains high performance across 18 continuous-control tasks.

03

QED achieves this with modest sample-efficiency costs.

Abstract

Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$ -function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that na\"ively increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.