Benchmarking In-context Experiential Learning Through Repeated Product Recommendations
Gilbert Yang, Yaqin Chen, Thomson Yen, Hongseok Namkoong

TL;DR
This paper introduces BELA, a benchmark for evaluating in-context experiential learning in product recommendation tasks, highlighting current models' limitations in adaptive learning through dialogue.
Contribution
The paper presents a new benchmark, BELA, combining real-world data, diverse personas, and a simulated environment to assess experiential learning in language models.
Findings
Current models show limited improvement across episodes.
BELA enables evaluation of adaptive learning in dialogue-based recommendation.
Highlighting the need for models with stronger in-context learning capabilities.
Abstract
To reliably navigate ever-shifting real-world environments, agents must grapple with incomplete knowledge and adapt their behavior through experience. However, current evaluations largely focus on tasks that leave no ambiguity, and do not measure agents' ability to adaptively learn and reason through the experiences they accrued. We exemplify the need for this in-context experiential learning in a product recommendation context, where agents must navigate shifting customer preferences and product landscapes through natural language dialogue. We curate a benchmark for experiential learning and active exploration (BELA) that combines (1) rich real-world products from Amazon, (2) a diverse collection of user personas to represent heterogeneous yet latent preferences, and (3) a LLM user simulator powered by the persona to create rich interactive trajectories. We observe that current…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper addresses an important problem, which is the ability to conduct personalized LLM research in interactive settings. - The benchmark is grounded in real Amazon product data. - There is a large number of different "people" and products to evaluate a LLM on. - The environment is interactive and dynamic, so the proxy humans are impacted by the decisions of the learning agent. - Based on the presented experiments, the problem needs active work as LLMs don't inherently have the ability to i
- The main weakness of this paper is that the personas and the LLM-as-a-Judge scoring approach are not well validated. - The paper the personas come from was used to identify limitations with current approaches to creating personas by identifying issues and biases with the set of 1M personas introduced, which are used in this paper. These personas were found to be biased, which means the they are not overly unique, limiting the impact of having 1M personas. - The only measure of the LL
Strengths: - This paper presents a novel task design for LLM study. The focus on learning across episodes is a great plus, as this is very important for practical deployment. However, this is largely unexplored. - The dataset size is also large, including ~71K products, ~2K choice sets, and ~1M personas, which can enable scalable evaluation across diverse domains. - This paper also provides a multi-model evaluation showing no meaningful improvement over episodes and poor uncertainty calibration
Weakness: - My biggest concern is that I could not find a user study or evaluation with real human preferences to validate the benchmark realism. I think this is critical to ground the benchmark in real human studies. - It'd be great if the authors could discuss how the choice of regret, stars, and text feedback can impact the results. - It'd be helpful to have pure human baselines. I understand that this can be costly, but at least mention this in the future work can be helpful to inform future
**Timely discussion about the experiential learning problem in the recommendation scenarios.** The paper highlights a crucial yet overlooked problem for current LLMs: in-context experimential learning. Moreover, this paper discusses experiential learning under the recommendation scenarios, which is practical and reasonable. **Comprehensive experimental setup.** BIEL provides a large-scale, systematically generated environment, spanning multiple user personas and product domains, allowing broad
**Lack of discussion on existing conversational recommender systems.** A similar and standard research era in recommender systems is the conversational recommender system. However, this paper lacks the necessary discussion about this research era. The multi-turn recommendation paradigm is actually a common setting in conversational recommender systems, requiring discussion [1]. **Lack of sufficient analysis about the poor performance of existing advanced LLMs.** Existing works demonstrate tha
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · AI in Service Interactions · Social Robot Interaction and HRI
