Provably Learning from Language Feedback

Wanqiao Xu; Allen Nie; Ruijie Zheng; Aditya Modi; Adith Swaminathan; Ching-An Cheng

arXiv:2506.10341·cs.LG·June 13, 2025

Provably Learning from Language Feedback

Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

PDF

Open Access 3 Reviews

TL;DR

This paper formalizes the problem of learning from language feedback, introduces a complexity measure called transfer eluder dimension, and presents a provably effective algorithm, HELiX, that outperforms naive prompting in various domains.

Contribution

It provides a theoretical framework for learning from language feedback, introduces transfer eluder dimension, and develops a no-regret algorithm with proven guarantees.

Findings

01

Learning from rich language feedback can be exponentially faster than from reward.

02

HELiX algorithm achieves performance guarantees based on transfer eluder dimension.

03

Empirical results show HELiX outperforms simple prompting methods.

Abstract

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $transfer eluder dimension$ as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $HELiX$ , that provably…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

- The paper provides a nice conceptual framework to understand the benefit of learning from language feedback, which is relevant in modern applications where LLM are good at providing language feedback beyond rewards - A general statistical measure, transfer eluder dimension is proposed, that can handle general hypothesis classes. - The initial set of theoretical results are somewhat complete, in the sense that it compares with reward-based feedback, as well as showing when it can be signific

Weaknesses

- Assumption 3 seems important, although maybe necessary for clear development of theoretical results. Do you have a sense if this is satisfied in your experiments? - (Clarity) Can the authors clarify what are some roadblocks in proving \sqrt{T} regret bound for HELiX under the general feedback setting (beyond the square loss assumption)? - I agree with the final remark by the authors that there are some LLF problems with infinite transfer Eluder dimension and are trivially solvable. I wonder

Reviewer 02Rating 2Confidence 4

Strengths

Language feedback is a rich additional source of information and under-explored topic, as the language models and tasks have only recently enabled progress in this area. In a research landscape where many language tasks and methods are ill-defined and not possible to make theoretical statements about, I am generally supportive of more formalization in the LLF direction so more precise mathematical statements can be made about improvement and learning in language spaces. The paper has a nice arc

Weaknesses

Despite being generally positive about formalizing this space, the experimental results are holding me back from giving an acceptance at this point. I am very open to discussing this. Before getting into the specific formal setting of the paper, Wordle, Battleship, and Minesweeper seem like well-studied tasks where other methods and experimental settings have been investigated. However, the results in Figure 2 are isolated from these, and not compared to what I would consider state-of-the-art A

Reviewer 03Rating 6Confidence 3

Strengths

- Important work to better understand in-context learning of LLMs in one step RL problems. This is a continuously growing field, with many works proposing to use LLM agents in a loop to achieve complex tasks, often greatly lacking of formalization. This theoretical work looks to propose a significant step to close that gap. - Theoretical assumptions and demonstrations look reasonnable and well sounded (I did not check all the proofs though) - Authors made an important effort for illustrating

Weaknesses

- Pedagogy and Positioning : While it is quite well written globally, I had a bit of difficulty fully grasping this paper. First, I believe the positioning should make it clearer that the work is set in a one-step RL (bandit) framework. It took me some time to realize that the paper does not consider the more common multi-step RL setting that I am more familiar with. I appreciate the authors’ effort to provide illustrative examples for the various abstract concepts — these are genuinely helpful

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Topic Modeling · Natural Language Processing Techniques