
TL;DR
This paper studies bandit learning in open multi-agent systems, addressing challenges like non-stationarity and information flow, and introduces new concepts and algorithms with provable regret guarantees.
Contribution
It formulates a unified open-system bandit framework with novel concepts and develops certified global-UCB algorithms with tight regret bounds.
Findings
Regret scales linearly with entry uncertainty via pre-training degree.
Stable regimes' regret depends on identifying persistent optimal arms.
Lower bounds confirm the tightness of the proposed regret dependencies.
Abstract
Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emph{pre-training degree} of new agents quantifies how much information an agent carries upon entry, \emph{stability} measures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
