Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare
Maheed H. Ahmed, Mahsa Ghasemi

TL;DR
This paper introduces a fair dueling bandits framework using Nash Social Welfare to ensure equitable learning across heterogeneous user preferences, providing theoretical bounds and algorithms.
Contribution
It formulates a fairness-aware dueling bandits model with user-specific Condorcet winners and establishes regret bounds, pioneering the quantification of fairness costs in this setting.
Findings
Established a regret lower bound of Ω(T^{2/3} min(K,D)^{1/3}) for fair dueling bandits.
Proposed algorithms with regret bounds matching the lower bound up to logarithmic factors.
Quantified the cost of fairness in heterogeneous preference scenarios.
Abstract
Learning from human preference data is becoming a useful tool, from fine-tuning large language models to training reinforcement learning agents. However, in most scenarios, the model is trained on the average preference of all human evaluators, which, under large variations of preferences, can be unfair to minority groups. In this work, we consider fairness in dueling bandits, a standard framework for online learning from preference data. We assume that each user has a (potentially distinct) Condorcet winner, which is an arm preferred to every other arm. Using these user-specific Condorcet winners as reference points, we evaluate and score arms according to their performance relative to the corresponding winner. To promote fairness across heterogeneous users, we adopt the well-established Nash Social Welfare objective, which maximizes the product of user utilities, thereby inherently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
