Can foundation models actively gather information in interactive environments to test hypotheses?
Danny P. Sawyer, Nan Rosemary Ke, Hubert Soyer, Martin Engelcke, David P Reichert, Drew A. Hudson, John Reid, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Mozer, Jane X Wang

TL;DR
This paper evaluates foundation models' ability to gather information and adapt in interactive, multi-trial environments, revealing that prompting for summaries fosters emergent meta-learning and highlighting the importance of adaptive knowledge integration.
Contribution
It introduces a new benchmark environment for testing meta-learning in foundation models and demonstrates that prompting strategies can enable emergent adaptive behaviors.
Findings
Models perform well in simple information gathering tasks.
Prompting for summaries enables meta-learning and adaptation.
Alchemy benchmark reveals robustness differences among models.
Abstract
Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their ability to learn from experience, adapt, and gather information. First, in "Feature World," a simple setting for testing information gathering, models performed near-optimally. However, to test more complex, multi-trial learning, we implemented a text-based version of the "Alchemy" environment, a benchmark for meta-learning. Here, agents must deduce a latent causal structure by integrating information across many trials. In this setting, recent foundation models initially failed to improve their performance over time. Crucially, we found that prompting the models to summarize their observations at regular intervals enabled an emergent meta-learning process. This allowed them to improve across…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The authors attempt to explore a novel aspect of foundation models - exploration in interactive environments, which is a unique study in its own. - The baselines (upper/lower bounds) are logically sounds. I particularly like the idea of an optimal baseline to know where the most efficient agent would lie in the spectrum. - The authors perform logical extensions (self-correction, long-context) and ablations (e.g. with correct visual outputs in the 3D environments) which help understand how well
- The evaluation is limited - for some reason the authors only consider the Gemini-based agent, and not any other LLM models. Curious to hear an explanation on why this was done. What about the results on other LLM families? - The results in Figure 3 are not unexpected. I would expect a larger LLM to be more efficient on account of more training/generalization ability, and LLMs to be worse than optimal baseline and better than random agents. I think there should be some more exploration towards
This is a very well written paper with great figures (not a common occurrence in ICLR papers). Presentation is prefect. I enjoyed reading this paper. Simple and well designed experimental task, which will be easy for a broad audience to understand.
Evaluation is focused on Gemini. I feel like the paper is very well presented, but in terms of research questions illuminated by this paper it is going after a low-hanging fruit. It is super easy to implement a hypothesis testing task in a text prompt, and to compare to random/optimal information seeking baselines. The main contribution of this paper seems to come from staging the contribution in a visual 3D task, however this staging does not tell us a lot about whether and how LLM explore.
- The paper is well-motivated, with a clear goal of studying active information gathering in large language models. - The task is simple enough for addressing the scientific questions authors trying to ask, with a minimum amount of confounding factors. - The paper is well-written and easy to read.
- Evaluations: the evaluations are not sufficient enough in many ways. First, while the paper is studying a human-like learning problem, but there is no human baseline presented. For example, would humans reach a near-optimal policy? Or they are more like the Gemini tested? Second, the title says foundations model"s". However, only the Gemini 1.5 model was tested. How do other models (Claude and GPTs) perform? - Relations to prior works: the setting authors introduced is not new. I think severa
Generally, the paper makes a strong case for the importance of information-gathering capabilities in foundation models and contributes valuable knowledge that can inform the development and application of AI systems. There are some strengths of the paper: - The selected topic the researchers focuses on seems interesting. This framework allows for the evaluation of models' ability to strategically gather and reason about information in a systematic way. - The implementation of the framework in bo
- The study primarily focuses on the Gemini 1.5 model, which may not fully represent the capabilities and behaviors of other foundation models. As a benchmark, evaluating a wider range of models could provide a more comprehensive understanding. This constraint limits the applicability and generalization of the study's findings to other models. - From my point of view, the assessment of pure LLMs' strategic information-gathering abilities appears less meaningful (e.g., compared with RL agents) d
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Data Visualization and Analytics
