Detecting Rewards Deterioration in Episodic Reinforcement Learning
Ido Greenberg, Shie Mannor

TL;DR
This paper introduces a statistical test for detecting reward deterioration in episodic reinforcement learning, effective without environment models and applicable online, outperforming standard methods in various control scenarios.
Contribution
It proposes a novel multivariate mean-shift detection method tailored for episodic RL rewards, with an innovative bootstrap-based false alarm control mechanism for online application.
Findings
Test outperforms standard methods by orders of magnitude in detecting reward deterioration.
Method is applicable to any episodic signal, not relying on environment models.
Effective in online detection of performance drifts in RL agents.
Abstract
In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible. Furthermore, it often has to be done without modifying the policy and under minimal assumptions regarding the environment. In this paper, we address this problem by focusing directly on the rewards and testing for degradation. We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov. We present this problem as a multivariate mean-shift detection problem with possibly partial observations. We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power. Empirically, on deteriorated rewards in control problems (generated using various environment modifications), the test is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsData Stream Mining Techniques · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research
