Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With   Eligibility Trace Under Reward, Policy, and Advantage Feedback

Ishaan Shah; David Halpern; Kavosh Asadi; Michael L. Littman

arXiv:2109.07054·cs.LG·September 16, 2021

Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback

Ishaan Shah, David Halpern, Kavosh Asadi, Michael L. Littman

PDF

Open Access

TL;DR

This paper analyzes the COACH algorithm's convergence in human-in-the-loop reinforcement learning under various feedback types and proposes E-COACH, a variant with proven convergence, comparing it with Q-learning and TAMER.

Contribution

It introduces E-COACH, a convergent variant of COACH, and provides theoretical analysis under multiple feedback schemes in human-in-the-loop RL.

Findings

01

E-COACH converges for reward, policy, and advantage feedback.

02

COACH can behave sub-optimally under certain feedback types.

03

E-COACH outperforms Q-learning and TAMER in convergence properties.

Abstract

Fluid human-agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are likely to provide. This work analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three different types of feedback-policy feedback, reward feedback, and advantage feedback. For these three feedback types, we find that COACH can behave sub-optimally. We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types. We compare our COACH variant with two other reinforcement-learning algorithms: Q-learning and TAMER.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research

MethodsQ-Learning