Gradient-free Online Learning in Games with Delayed Rewards
Am\'elie H\'eliou, Panayotis Mertikopoulos, Zhengyuan Zhou

TL;DR
This paper introduces a gradient-free learning approach for multi-player games with delayed, asynchronous rewards, proving convergence to Nash equilibrium despite unbounded delays in feedback.
Contribution
It develops a novel gradient-free learning policy for delayed feedback in multi-player continuous action games and proves convergence to Nash equilibrium under broad conditions.
Findings
New regret bounds for delayed reward settings
Convergence to Nash equilibrium with probability 1
Applicable to unbounded delay scenarios
Abstract
Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems
