Hybrid Combinatorial Multi-armed Bandits with Probabilistically Triggered Arms

Kongchang Zhou; Tingyu Zhang; Wei Chen; Fang Kong

arXiv:2512.21925·cs.LG·December 29, 2025

Hybrid Combinatorial Multi-armed Bandits with Probabilistically Triggered Arms

Kongchang Zhou, Tingyu Zhang, Wei Chen, Fang Kong

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a hybrid framework for combinatorial multi-armed bandits with probabilistically triggered arms, combining offline data and online learning to improve exploration, convergence, and bias correction.

Contribution

It proposes the hybrid CUCB algorithm that integrates offline data with online interaction, providing theoretical regret guarantees and empirical validation.

Findings

01

Hybrid CUCB outperforms purely online methods with high-quality offline data.

02

The algorithm effectively corrects offline data bias when data is limited.

03

Empirical results show consistent advantage over existing approaches.

Abstract

The problem of combinatorial multi-armed bandits with probabilistically triggered arms (CMAB-T) has been extensively studied. Prior work primarily focuses on either the online setting where an agent learns about the unknown environment through iterative interactions, or the offline setting where a policy is learned solely from logged data. However, each of these paradigms has inherent limitations: online algorithms suffer from high interaction costs and slow adaptation, while offline methods are constrained by dataset quality and lack of exploration capabilities. To address these complementary weaknesses, we propose hybrid CMAB-T, a new framework that integrates offline data with online interaction in a principled manner. Our proposed hybrid CUCB algorithm leverages offline data to guide exploration and accelerate convergence, while strategically incorporating online interactions to…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

- The problem appears to be novel. CMAB-T has been studied in purely online (multiple works) and recently in purely online. There is also work for several bandit settings using hybrid data, though before Cheung and Lyu (2024) most/all considered identically distributed environments. - The authors analyze gap-dependent and gap-independent bounds, recovering classic regret bounds for CUCB alg for CMAB-T as a special case. - The authors run experiments in both biased and unbiased settings.

Weaknesses

### Major - My primary concerns are on novelty. The algorithm H-UCB appears to be CUBC with the paired UCB estimators from Cheung and Lyu (ICML 2024). It is unclear to me how much technical novelty there is in adapting CUCB regret bounds in light of Cheung & Lyu’s prior work on adapting MAB. - In Section 4, the authors state “we design a hybrid confidence bound...” that is identical to Cheung and Lyu’s (their Alg 1 step 6) without acknowledgement (at least in that section). (The author

Reviewer 02Rating 4Confidence 3

Strengths

1. The setting for hybrid CMAB-T is naturally and clean. 2. By following prior papers and combining algorithms together, the Hybrid CUCB works naturally and well, validated both theoretically and empirically. 3. The interpretations and intuitions of saving samples for terms $N_i^{\prime}$ and $N_i^{\prime \prime}$ are clear and good to be understood. 4. Provide both gap-dependent bound and gap-independent bound, comprehensive in the upper bound results.

Weaknesses

1. Lacking of lower bound: This paper provides promising upper bounds including gap dependent version and independent version. However, no lower bound is provided. In [1], lower bounds for both instance-dependent and instance-independent versions are given, and the optimality is guaranteed. I treat this part of contribution as crucial in their work since it is crucial to see how effective of these offline samples can really function as, at most, for the biased hybrid setting. By only providing t

Reviewer 03Rating 4Confidence 3

Strengths

- The hybrid learning problem in CMAB-T is relevant and underexplored. This paper provides a systematic formalization, bridging existing gaps in purely online and purely offline settings. - Both gap-dependent (Theorem 1) and gap-independent (Theorem 2), regret bounds are derived. - Results in Figures 1 and 2 show consistent improvement of hybrid CUCB over CUCB (online) and CLCB (offline) across different offline data sizes and bias settings.

Weaknesses

1. Mostly small-scale synthetic + one real dataset; no stress tests on larger action spaces/trigger structures or runtime. 2. The paper hypes the hybrid approach’s advantage but gives limited attention to scenarios where hybridization fails (i.e., more adverse bias structures, very small offline samples, or worst-case triggering).

Reviewer 04Rating 4Confidence 3

Strengths

1. To the best of my knowledge, this paper provides the first hybrid (offline+online) treatment for general CMAB-T with probabilistic triggering, giving regret bounds that reduce to online CUCB when offline data are unhelpful and improve with aligned offline data. 2. The decomposition of the regret clearly shows the benefit of using offline data via its effective number of samples, resulting in $O(-\sqrt{N’_i})$ savings in Theorem 1. Discussions of Theorems 1 and 2 explain how the upper bounds

Weaknesses

1. While the authors make connections between their general results and the edge cases with fully informative offline data and non-informative offline data, they provide no new lower bounds on the regret. This negatively impacts the significance of the work since it is mainly theoretical in nature, and related works of this kind usually have accompanying lower bounds. 2. A limitation is that the algorithm requires the knowledge of an upper bound on $V$, the discrepancy of the means of online a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems