Meta-RL Induces Exploration in Language Agents

Yulun Jiang; Liangze Jiang; Damien Teney; Michael Moor; Maria Brbic

arXiv:2512.16848·cs.LG·March 10, 2026

Meta-RL Induces Exploration in Language Agents

Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LaMer, a Meta-RL framework that enhances language agents' exploration and adaptation capabilities, leading to significant performance improvements and better generalization in complex, unseen tasks.

Contribution

LaMer is a novel Meta-RL approach that enables language agents to actively explore and adapt without gradient updates, improving robustness and performance.

Findings

01

11-19% performance gains on benchmark tasks

02

Better generalization to unseen environments

03

Effective in diverse multi-turn tasks

Abstract

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19%…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

The paper is well-motivated and is written clearly to explain how the authors are tackling the problem of exploration during RL. While the Meta-RL framework itself is not novel, the application to LLMs is. The paper shows significant gains over single episode training across multiple benchmarks. The paper also shows out of distribution generalization compared to non meta learning on unseen benchmarks.

Weaknesses

See the "questions" section. Beyond this I would be interested in comparing against pass@k metrics for meta-exploration that have previously been explored in RL (for example, see Walder et. al Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems). (I do not think this is necessary for this to be a good paper! Just a suggestion for extension.)

Reviewer 02Rating 6Confidence 5

Strengths

- The paper is well written and easy to follow, with clearly stated hypotheses and a clean experimental design. The idea is simple and promising to help improve LLM-agents to explore in RL tasks. - It presents strong empirical results showing that meta-RL with reflection substantially improves the performance of LLM agents across multiple game environments.

Weaknesses

- The paper does not analyze the reflections generated by the LLM. Do they make sense? Do they evolve over time? Is performance improvement mainly driven by changes in the policy or in the reflections themselves? - The paper does not analyze the reflections generated by the LLM. Do they make sense? Do they evolve over time? Is performance improvement mainly driven by changes in the policy or in the reflections themselves? - All experiments are conducted with a single LLM (Qwen3-4B). It would be

Reviewer 03Rating 6Confidence 4

Strengths

The idea is clear and well motivated. It’s conceptually simple and general. The evaluation is broad and the study of generalization is interesting. The paper is clear and well-written.

Weaknesses

**What is actually learned?** It’s not so clear to me whether the agent is only conditioned on the reflection from the previous episode or on the previous history too? If conditioned on both, it would be good to ablate the reflection mechanism and see how performance holds. Ablating history and only keeping reflections might be interesting too. Is the feedback generation capacity trained too? Or just leveraging a frozen model? Maybe the approach doesn’t actually train exploratory behaviors but

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling