Failing to Explore: Language Models on Interactive Tasks

Mahdi JafariRaviz; Keivan Rezaei; Arshia Soltani Moakhar; Zahra Sodagar; Yize Cheng; Soheil Feizi

arXiv:2601.22345·cs.LG·February 2, 2026

Failing to Explore: Language Models on Interactive Tasks

Mahdi JafariRaviz, Keivan Rezaei, Arshia Soltani Moakhar, Zahra Sodagar, Yize Cheng, Soheil Feizi

PDF

Open Access 1 Datasets

TL;DR

This paper evaluates how well language models explore interactive environments with limited interactions, revealing systematic under-exploration and proposing interventions to improve exploration efficiency.

Contribution

It introduces three controllable exploration tasks and studies lightweight interventions that enhance exploration performance of language models.

Findings

01

Models under-explore and perform worse than simple heuristics.

02

Splitting interaction budget into parallel runs improves exploration.

03

Periodic summarization preserves discoveries and boosts exploration.

Abstract

We evaluate language models on their ability to explore interactive environments under a limited interaction budget. We introduce three parametric tasks with controllable exploration difficulty, spanning continuous and discrete environments. Across state-of-the-art models, we find systematic under-exploration and suboptimal solutions, with performance often significantly worse than simple explore--exploit heuristic baselines and scaling weakly as the budget increases. Finally, we study two lightweight interventions: splitting a fixed budget into parallel executions, which surprisingly improves performance despite a no-gain theoretical result for our tasks, and periodically summarizing the interaction history, which preserves key discoveries and further improves exploration.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AghaTizi/explore-exploit-bench
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Games