Minimax-optimal trust-aware multi-armed bandits
Changxiao Cai, Jiacheng Zhang

TL;DR
This paper introduces a trust-aware multi-armed bandit framework that models human trust dynamics, establishing minimax regret bounds and proposing a novel algorithm that outperforms traditional methods like UCB in trust-influenced scenarios.
Contribution
It develops a new trust-aware MAB model, derives minimax regret bounds considering trust dynamics, and proposes a near-optimal algorithm to address trust-related suboptimality.
Findings
Vanilla UCB is suboptimal under trust dynamics.
The proposed two-stage algorithm achieves near-optimal regret.
Simulation results demonstrate improved performance with trust considerations.
Abstract
Multi-armed bandit (MAB) algorithms have achieved significant success in sequential decision-making applications, under the premise that humans perfectly implement the recommended policy. However, existing methods often overlook the crucial factor of human trust in learning algorithms. When trust is lacking, humans may deviate from the recommended policy, leading to undesired learning performance. Motivated by this gap, we study the trust-aware MAB problem by integrating a dynamic trust model into the standard MAB framework. Specifically, it assumes that the recommended and actually implemented policy differs depending on human trust, which in turn evolves with the quality of the recommended policy. We establish the minimax regret in the presence of the trust issue and demonstrate the suboptimality of vanilla MAB algorithms such as the upper confidence bound (UCB) algorithm. To overcome…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The authors consider an important problem, of imperfect execution by downstream actors during the implementation of a sequential decision-making algorithm. The model is reasonably formulated, with a changing "trust level" by the human depending on the policy recommendations so far. The notion of private information (i.e., the trust set) that is not visible to the policymaker is also reasonable and well-formulated. 2. The theoretical results are rigorous, first showing how the standard UCB a
1. Empirical results are limited and fail to show robustness. There is only a single setting, where the number of arms is rather small ($K=10$), the trust set is extremely small (just 2 arms, $\mathcal{T} = \{9,10\}$), and the human has a very naive policy (uniform exploration, with no learning). The arms are also selected so that the trust set is the two highest-reward arms. How does performance compare when the trust set is 50\% of arms? What about a larger number of arms? What about when the
1) The paper addresses a critical gap by incorporating human trust into MAB frameworks, which is relevant for real-world applications like human-robot interactions. 2) The authors establish the minimax lower bound for their trust-aware model and demonstrate that standard UCB algorithms can incur near-linear regret under trust-related deviations. 3) The authors develop a novel trust-aware UCB algorithm, supported by rigorous theoretical results, including a near-minimax optimal regret bound.
1) The trust model used in this paper (the disuse trust behavior model) is relatively simple, and its applicability to more complex trust frameworks may be limited. 2) While the regret bound of Algorithm 1 in Theorem 2 is nearly optimal with respect to the time horizon $H$, the dependence on $K$ could be improved. As the authors note in Remark 4, optimizing the trust set identification stage could help reduce the burn-in cost. Additionally, Assumption 1 might further improve the regret in the t
- The proposed algorithm provides a way to estimate the dynamical model with stochastic feedback in a small number of rounds. This appears to be a relatively difficult problem that is certainly more difficult than estimating the mean of a fixed distribution (as is typically done in explore-exploit problems for stochastic bandits). - The paper shows (nearly)-minimax optimal regret. - The presentation is good.
- I didn't see where the paper states the information available to the algorithm. It looks like the algorithm (in Algorithm 1) uses $\chi_h$. I don't think this is very realistic because the algorithm is observing the implementer's internal randomness. It would be more realistic if the algorithm observed the implementer's action $a_h^{ac}$. Note that this is not the same as observing $\chi_h$ because the implementer's policy could be the same as the algorithm's in some rounds and therefore $a_h^
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Distributed Sensor Networks and Detection Algorithms · Age of Information Optimization
