Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback
Ishaan Shah, David Halpern, Kavosh Asadi, Michael L. Littman

TL;DR
This paper analyzes the COACH algorithm's convergence in human-in-the-loop reinforcement learning under various feedback types and proposes E-COACH, a variant with proven convergence, comparing it with Q-learning and TAMER.
Contribution
It introduces E-COACH, a convergent variant of COACH, and provides theoretical analysis under multiple feedback schemes in human-in-the-loop RL.
Findings
E-COACH converges for reward, policy, and advantage feedback.
COACH can behave sub-optimally under certain feedback types.
E-COACH outperforms Q-learning and TAMER in convergence properties.
Abstract
Fluid human-agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are likely to provide. This work analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three different types of feedback-policy feedback, reward feedback, and advantage feedback. For these three feedback types, we find that COACH can behave sub-optimally. We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types. We compare our COACH variant with two other reinforcement-learning algorithms: Q-learning and TAMER.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research
MethodsQ-Learning
