Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits
Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud and, Quanquan Gu

TL;DR
This paper introduces a variance-aware regret bound for stochastic contextual dueling bandits, improving decision-making models that incorporate uncertainty in pairwise comparisons, with theoretical guarantees and empirical validation.
Contribution
The paper proposes a new efficient SupLinUCB-type algorithm with a variance-aware regret bound for contextual dueling bandits, addressing a gap in existing research.
Findings
The regret bound scales with the sum of variances in pairwise comparisons.
The algorithm performs better than previous variance-agnostic methods on synthetic data.
The regret is minimized in deterministic comparison scenarios.
Abstract
Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $\tilde…
Peer Reviews
Decision·ICLR 2024 poster
The problem setup is well laid out and easy to follow with the precisely required assumptions.
The paper is not self contained and requires reader to go through multiple papers for example in section 4.2.
1. This paper is in general well organized and easy to follow. 1. The variance is considered in the dueling bandit setting. 1. I appreciate the detailed review on existing literature, and the clarification on the novelty of the proposed algorithm and the differences from existing algorithms (especially SupLinUCB-type ones).
1. The variance $\sum_{t=1}^T \sigma_t$ in the regret bound is a random variable. I think it would be much better to involve a term that indicates the variance of the instance in some sense but is not random in an expected bound. 1. The appearance of $\sum_{t=1}^T \sigma_t$ indicates that even if we know $X_t$ for all arms and $\theta^*$, we may not know the value of the derived upper bound. 1. I wonder if it is possible to derive a lower bound for the problem. If not, may the author(s) clar
- The primary contribution of the paper is the introduction of a new algorithm VACDB (Variance-Aware Contextual Dueling Bandits), which incorporates a SupLinUCB-type approach to handle the contextual information and provide a variance-aware regret bound. - The regret bound $O\left(\sqrt{d\sum_{t=1}^{T} \sigma_t^2} + d\right)$ proposed by the authors provides a more nuanced performance measure that reflects the difficulty of the decision-making problem. - Beyond the specific algorithm and regret
- The paper primarily conducts experiments on synthetic data to validate the proposed algorithm. While this is a common practice, the performance of the algorithm in real-world scenarios might differ. To strengthen the paper, the authors could include experiments on real-world datasets, particularly those related to the applications mentioned like ranking, recommendation systems, or any human-interactive system, ensuring the practicality and robustness of the algorithm in diverse settings. - The
Code & Models
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Distributed Sensor Networks and Detection Algorithms · Data Stream Mining Techniques
