Offline Clustering of Linear Bandits: The Power of Clusters under Limited Data
Jingyuan Liu, Zeyu Zhang, Xuchuang Wang, Xutong Liu, John C.S. Lui, Mohammad Hajiesmaili, Carlee Joe-Wong

TL;DR
This paper introduces offline clustering algorithms for linear bandits that leverage limited offline data to improve decision-making, addressing data scarcity challenges absent in online methods.
Contribution
It proposes two novel algorithms, Off-C2LUB and Off-CLUB, tailored for offline bandit clustering with theoretical analysis and empirical validation.
Findings
Off-C2LUB outperforms existing methods with limited offline data.
Off-CLUB performs well with sufficient data, nearing the theoretical lower bound.
Algorithms are validated on real and synthetic datasets.
Abstract
Contextual multi-armed bandit is a fundamental learning framework for making a sequence of decisions, e.g., advertising recommendations for a sequence of arriving users. Recent works have shown that clustering these users based on the similarity of their learned preferences can accelerate the learning. However, prior work has primarily focused on the online setting, which requires continually collecting user data, ignoring the offline data widely available in many applications. To tackle these limitations, we study the offline clustering of bandits (Off-ClusBand) problem, which studies how to use the offline dataset to learn cluster properties and improve decision-making. The key challenge in Off-ClusBand arises from data insufficiency for users: unlike the online case where we continually learn from online data, in the offline case, we have a fixed, limited dataset to work from and…
Peer Reviews
Decision·Submitted to ICLR 2026
1) The paper certainly considers the offline version of the clustered linear bandit setup. 2) Paper is wary of the bias due to fixed but limited data forcing heterogenous users into clusters causing issues in parameter estimates for the decision. Thats is interesting and commendable. 3) Error is decomposable as a $O(1/\sqrt{\lambda_a N})$ term where $N$ is the set of homogenous users identified in the test user's cluster and a bias term that depends on the inclusion of heterogenous users.
1) The whole machinery revolves around standard concentration of a linear gaussian model with sub gaussian noise under the case when data matrix has lowest eigenvalue bounded below. Everything is a more detailed manipulation of the confidence estimates with a clustering routine that aggregates users with similar parameter estimates upto a confidence estimate. While the approach to be cautious with respect to clustering heterogenous users reflected in bounds and the approach, I am quite unclear
1: The ''cluster from logs → act once'' framing is well-motivated for applications where online interaction is limited. 2: The empty-graph vs complete-graph constructions align with low-data vs high-data regimes and are easy to implement. 3: The analysis surfaces a bias–variance trade-off around the clustering threshold and identifies a data-sufficiency regime where the complete-graph pruning approach performs well. 4: Results across synthetic and real data broadly match the narrative.
1: Limited technical novelty. The methods largely assemble standard components—ridge regression with confidence sets, distance-threshold user graphs, and pessimistic action selection. The paper’s contribution is more in problem formulation and tidy integration than in new algorithmic primitives or estimation techniques. The one-hop aggregation choice (vs. full component) is interesting but not theoretically pinned down as a strict improvement beyond intuition and ablations. 2: Strong regularity
The authors consider a salient problem of pooling data from heterogenous users together to make decisions in adapting to a new user. The paper is largely rigorously written, and the proofs appear to be sound. The experiments are involved, especially for a theory paper.
1. The authors are very much unaware of a large body of existing literature in this field. - The "clusters of bandits" problem is essentially an offline version of the latent bandit problem. See Hong et al. (2020), who first learn clusters from offline data before taking actions online. Shi et al. (2023) discuss offline latent RL, which is a variant of the latent MDP setting of Kwon et al. (2021). - Rigorous guarantees for clustering tabular MDPs under the Markovian setting have been obtained
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Bandit Algorithms Research · Recommender Systems and Techniques
