In-Context Learning for Pure Exploration
Alessio Russo, Ryan Welch, Aldo Pacchiano

TL;DR
This paper introduces ICPE, a Transformer-based model that actively and efficiently performs pure exploration tasks like best-arm identification and generalized search without explicit modeling, demonstrating competitive results across various benchmarks.
Contribution
The paper presents ICPE, a novel Transformer-based approach for pure exploration that transfers in-context and operates without explicit information structure modeling.
Findings
ICPE performs competitively with adaptive baselines.
ICPE requires no parameter updates during inference.
Transformers are effective for general sequential testing.
Abstract
We study the problem active sequential hypothesis testing, also known as pure exploration: given a new task, the learner adaptively collects data from the environment to efficiently determine an underlying correct hypothesis. A classical instance of this problem is the task of identifying the best arm in a multi-armed bandit problem (a.k.a. BAI, Best-Arm Identification), where actions index hypotheses. Another important case is generalized search, a problem of determining the correct label through a sequence of strategically selected queries that indirectly reveal information about the label. In this work, we introduce In-Context Pure Explorer (ICPE), which meta-trains Transformers to map observation histories to query actions and a predicted hypothesis, yielding a model that transfers in-context. At inference time, ICPE actively gathers evidence on new tasks and infers the true…
Peer Reviews
Decision·ICLR 2026 Poster
(1) The theoretical formulation is principled and elegant. The authors derive an information-theoretic reward function directly from the posterior optimality conditions, avoiding ad hoc design choices. (2) By showing that Transformers can meta-learn to explore—learning both when and how to query—ICPE opens a new research direction connecting in-context learning with active learning and sequential testing.
(1) It remains unclear how ICPE scales to unseen environment distributions or how its meta-training distribution affects performance. (2) The experimental suite, though diverse, primarily focuses on low-dimensional or discrete problems. These settings validate proof-of-concept behavior but do not test ICPE’s limits in large or continuous hypothesis spaces. (3) Many experiments rely on synthetic or well-defined priors over tasks, where the environment distribution and hypotheses are known. In mor
- The paper is written very well in my opinion. I am not deeply familiar with the pure exploration problem but I still feel like the paper was accessible and did a good job of conveying necessary background and its own technical contribution. - The experiments have a good breadth of domains covered, including both toy and real-world inspired. Across these domains, the proposed method improves the key performance criteria compared to baselines. - I find the approach of meta-learning both an infer
- The empirical study only considers 3-5 seeds per baseline ran. This seems much too little for understanding the true spread of results. Confidence intervals are computed with hierachical bootstrapping but no explanation for why this method was chosen is given. - The importance of the theoretical results in this paper is unclear.
1. Using a transformer-based approach to solve the pure exploration problem is novel and interesting. 2. The approach is theoretically sound. For both the fixed confidence and fixed budget settings, the paper defines an MDP structure with corresponding reward functions. The paper shows that achieving the pure exploration objective is equivalent to finding an optimal policy for the corresponding MDP problem. 3. Compared with existing approaches, the proposed in-context pure exploration does not
1. My main concern lies in the setup of the training set, as the experimental section does not disclose these details. In general, a core challenge in online learning is that data arrive sequentially. If the method requires an extensive training set, along with prior knowledge of $H^*$ on that training set, and assumes that for each chosen $a_t$ during the training process, the corresponding $x_{t+1}$ can be observed, this may significantly limit its applicability in real-world scenarios. I hope
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
