Reinforcement Learning from Adversarial Preferences in Tabular MDPs
Taira Tsuchiya, Shinji Ito, Haipeng Luo

TL;DR
This paper introduces a new framework for reinforcement learning in tabular MDPs with adversarial preferences, focusing on Borda scores, establishing regret lower bounds, and proposing algorithms with near-optimal regret bounds.
Contribution
It formulates preference-based MDPs with adversarial preferences, derives regret lower bounds, and develops algorithms achieving near-optimal regret in this setting.
Findings
Established regret lower bounds for preference-based MDPs.
Developed algorithms with regret bounds of order T^{2/3}.
Extended algorithms to unknown transition settings.
Abstract
We introduce a new framework of episodic tabular Markov decision processes (MDPs) with adversarial preferences, which we refer to as preference-based MDPs (PbMDPs). Unlike standard episodic MDPs with adversarial losses, where the numerical value of the loss is directly observed, in PbMDPs the learner instead observes preferences between two candidate arms, which represent the choices being compared. In this work, we focus specifically on the setting where the reward functions are determined by Borda scores. We begin by establishing a regret lower bound for PbMDPs with Borda scores. As a preliminary step, we present a simple instance to prove a lower bound of for episodic MDPs with adversarial losses, where is the number of steps per episode, is the number of states, is the number of actions, and is the number of episodes. Leveraging this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Formal Methods in Verification · Advanced Malware Detection Techniques
