Learning to Search Better Than Your Teacher
Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daum\'e III,, John Langford

TL;DR
The paper introduces LOLS, a novel learning to search algorithm that not only performs well relative to a reference policy but also guarantees low regret against deviations from the learned policy, enabling improvements over suboptimal references.
Contribution
LOLS provides a new learning to search method with local-optimality guarantees, allowing it to outperform poor reference policies and facilitating structured contextual bandits.
Findings
LOLS guarantees low regret compared to deviations from the learned policy.
LOLS can improve upon suboptimal reference policies.
Develops structured contextual bandits for partial information settings.
Abstract
Methods for learning to search for structured prediction typically imitate a reference policy, with existing theoretical guarantees demonstrating low regret compared to that reference. This is unsatisfactory in many applications where the reference policy is suboptimal and the goal of learning is to improve upon it. Can learning to search work even when the reference is poor? We provide a new learning to search algorithm, LOLS, which does well relative to the reference policy, but additionally guarantees low regret compared to deviations from the learned policy: a local-optimality guarantee. Consequently, LOLS can improve upon the reference policy, unlike previous algorithms. This enables us to develop structured contextual bandits, a partial information structured prediction setting with many potential applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
