Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning
Xiangkun Wu, Qianglin Wen, Yingying Zhang, Hongtu Zhu, Ting Li, and Chengchun Shi

TL;DR
This paper introduces a transformer reinforcement learning method for time series A/B testing, enabling better treatment allocation by fully leveraging historical data and directly optimizing the treatment effect estimation accuracy.
Contribution
It presents a novel transformer RL approach that overcomes limitations of existing designs by conditioning on full history and directly optimizing the mean squared error.
Findings
Outperforms existing designs on synthetic data
Effective in real-world ridesharing dataset
Demonstrates improved treatment effect estimation
Abstract
A/B testing has become a gold standard for modern technological companies to conduct policy evaluation. Yet, its application to time series experiments, where policies are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition allocation on the entire history and employs…
Peer Reviews
Decision·ICLR 2026 Poster
Importance. I like that the paper targets time-series A/B testing where carryover effects are the norm; it seems very relevant for real platforms that roll out policies sequentially. Empirical breadth. The breadth of experiments (synthetic, real-data-based, public simulator) and the many replications with CIs looks good; it seems the comparisons are careful. The overall presentation of the paper is clear.
Scope and assumptions of Theorem 1. The impossibility result hinges on constructing settings where optimal assignment depends on the full history; could you (a) delimit the regularity conditions under which this dependence is strictly necessary and (b) discuss practically checkable conditions indicating when shorter-memory designs are near-optimal? (Right now, the proof shows existence rather than prevalence.) You define ATE and optimize MSE, but multiple ATE estimators appear (e.g., OLS in li
- The paper provides a comprehensive and well-structured literature review, clearly situating the work within the A/B testing, experimental design, and reinforcement learning communities. - It proposes a novel use of reinforcement learning for experimental design, employing transformer architectures to condition treatment allocation on the entire observed history, encoded as an augmented state. This design choice is both technically interesting and conceptually well-motivated. - The experimental
My main concern lies in the formulation and proof of the main theoretical result (Theorem 1). - The statement of Theorem 1 does not align with what is actually proved. The optimization problem (2) is formulated for an arbitrary ATE estimator, implying that the theorem should hold universally. However, the proof effectively fixes a specific estimator, and the argument depends crucially on that choice. As a result, the theorem as stated appears too general, and it is not demonstrated that the cla
1) The central contribution and gap in the literature are well presented. 2) The evaluation covers various baselines and both synthetic and real data. 3) The discussion of related literature is extensive.
1) While it is positive that the authors discuss a lot of related literature, the discussion itself is not as easy to follow. There are many references, but often little discussion to clarify the types of approaches presented. Also, the specific limitations are not always clear. For instance, when discussing A/B testing, it is said that some works relax the Markov assumption by modeling the data as a partially observable Markov decision process. Why is this still not enough to solve the problem?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research
