TL;DR
This paper introduces a new reinforcement learning setting where feedback is limited to trajectory scores rather than individual rewards, and develops algorithms with regret analysis for this weaker feedback model.
Contribution
It extends RL algorithms to the trajectory feedback setting, including unknown transition models, and provides regret bounds for these algorithms.
Findings
Algorithms achieve sublinear regret under trajectory feedback.
Hybrid optimistic-Thompson Sampling approach is tractable for unknown transitions.
The work broadens RL applicability to settings with limited feedback.
Abstract
The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as \emph{trajectory feedback}. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
