Learning with Options that Terminate Off-Policy
Anna Harutyunyan, Peter Vrancx, Pierre-Luc Bacon, Doina Precup, Ann, Nowe

TL;DR
This paper introduces Q(β), an off-policy learning algorithm for options that decouples behavior and target terminations, enabling flexible and efficient learning of policies with various termination conditions.
Contribution
It proposes a novel algorithm, Q(β), that learns optimal policies with arbitrary termination conditions by decoupling behavior and target terminations, extending off-policy learning frameworks.
Findings
Q(β) effectively learns policies with different termination conditions.
The algorithm outperforms traditional methods in flexibility and efficiency.
Empirical results validate the theoretical advantages of decoupling terminations.
Abstract
A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy exactly, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(\beta), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
