Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback
Ming Shi

TL;DR
This paper introduces a novel probe-then-commit algorithm for multi-objective bandits with limited multi-arm feedback, demonstrating a theoretical acceleration in learning efficiency proportional to the number of probes.
Contribution
It develops the PtC-P-UCB algorithm with frontier-aware probing and provides theoretical regret bounds showing benefits of limited multi-arm probing in multi-objective bandit problems.
Findings
Achieves a $1/\sqrt{q}$ acceleration in regret bounds with limited probing.
Extends to multi-modal probing with variance-adaptive bounds.
Provides theoretical guarantees for Pareto frontier exploration.
Abstract
We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among candidate links/servers (arms) whose performance is a stochastic -dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is \emph{probe-then-commit (PtC)}: the agent may probe up to candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits () and full-information experts (), yet existing multi-objective learning theory largely focuses on these extremes. We develop \textsc{PtC-P-UCB}, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Age of Information Optimization · IoT and Edge/Fog Computing
