DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction
Aviral Kumar, Abhishek Gupta, Sergey Levine

TL;DR
DisCor introduces a distribution correction method for reinforcement learning that improves stability and performance by re-weighting training data, addressing issues caused by distribution mismatch and feedback loops.
Contribution
The paper proposes DisCor, a novel algorithm that approximates an optimal data distribution correction to enhance RL training stability and effectiveness.
Findings
DisCor significantly improves learning in noisy and sparse reward environments.
Theoretical analysis shows distribution correction reduces instability in Q-learning.
Empirical results demonstrate better convergence and performance across multiple tasks.
Abstract
Deep reinforcement learning can learn effective policies for a wide range of tasks, but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. When using standard supervised methods (e.g., for bandits), on-policy data collection provides "hard negatives" that correct the model in precisely those states and actions that the policy is likely to visit. We call this phenomenon "corrective feedback." We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from this corrective feedback, and training on the experience collected by the algorithm is not sufficient to correct errors in the Q-function. In fact, Q-learning and related methods can exhibit pathological interactions between the distribution of experience collected by the agent and the policy induced by training on that experience, leading to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Adaptive Dynamic Programming Control
MethodsQ-Learning
