Reinforcement Learning from Adversarial Preferences in Tabular MDPs

Taira Tsuchiya; Shinji Ito; Haipeng Luo

arXiv:2507.11706·cs.LG·July 17, 2025

Reinforcement Learning from Adversarial Preferences in Tabular MDPs

Taira Tsuchiya, Shinji Ito, Haipeng Luo

PDF

Open Access

TL;DR

This paper introduces a new framework for reinforcement learning in tabular MDPs with adversarial preferences, focusing on Borda scores, establishing regret lower bounds, and proposing algorithms with near-optimal regret bounds.

Contribution

It formulates preference-based MDPs with adversarial preferences, derives regret lower bounds, and develops algorithms achieving near-optimal regret in this setting.

Findings

01

Established regret lower bounds for preference-based MDPs.

02

Developed algorithms with regret bounds of order T^{2/3}.

03

Extended algorithms to unknown transition settings.

Abstract

We introduce a new framework of episodic tabular Markov decision processes (MDPs) with adversarial preferences, which we refer to as preference-based MDPs (PbMDPs). Unlike standard episodic MDPs with adversarial losses, where the numerical value of the loss is directly observed, in PbMDPs the learner instead observes preferences between two candidate arms, which represent the choices being compared. In this work, we focus specifically on the setting where the reward functions are determined by Borda scores. We begin by establishing a regret lower bound for PbMDPs with Borda scores. As a preliminary step, we present a simple instance to prove a lower bound of $Ω (H S A T)$ for episodic MDPs with adversarial losses, where $H$ is the number of steps per episode, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. Leveraging this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Formal Methods in Verification · Advanced Malware Detection Techniques