Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

Yimeng Zhang; Tian Wang; Jiri Gesi; Ziyi Wang; Yuxuan Lu; Jiacheng Lin; Sinong Zhan; Vianne Gao; Ruochen Jiao; Junze Liu; Kun Qian; Yuxin Tang; Ran Xue; Houyu Zhang; Qingjun Cui; Yufan Guo; Dakuo Wang

arXiv:2507.17842·cs.CL·February 24, 2026

Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang

PDF

Open Access 3 Reviews

TL;DR

Shop-R1 introduces a reinforcement learning framework that enhances large language models' ability to simulate human online shopping behavior by decomposing the task into rationale generation and action prediction, guided by distinct reward signals.

Contribution

This work presents a novel RL approach that improves LLM reasoning and behavior simulation in online shopping, surpassing previous supervised fine-tuning methods.

Findings

01

Achieves over 65% relative improvement over baseline.

02

Effectively guides LLM reasoning with internal model signals.

03

Hierarchical reward structure prevents reward hacking.

Abstract

Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments. Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The task is unique and useful and well-motivated; and this approach is a useful step for scalable UX evaluation, A/B testing simulators, and recommendation research. - Introduces a simulation-oriented RL design for human-behavior replay rather than task completion, with a two-head reward structure (rationale + action). - The hierarchical + DARS scheme is a pragmatic way to densify rewards and disincentivize “terminate”-spamming. In addition, the reward is a well-motivated reward design. The

Weaknesses

- All results rely on a proprietary single-site dataset; there’s no evaluation on public benchmarks or cross-datasite generalization. - Self-certainty reward risks overconfidence. Using KL-to-uniform for rationales may incentivize confident but incorrect chains of thought; calibration is not reported. - Reward design sensitivity. DARS is fixed at 1000, but sensitivity curves and thresholding for ROUGE-L (the text says “e.g., 0.75”, p. 5) are not systematically explored.

Reviewer 02Rating 4Confidence 2

Strengths

1. The work tackles the important and practical challenge of simulating nuanced, human-like behavior in web environments, moving beyond simple task completion. 2. The paper introduces a sophisticated, hierarchical reward scheme with difficulty-aware scaling that effectively mitigates common reward-hacking failure modes. 3. The empirical results are strong, demonstrating a significant 65% relative improvement in exact-match accuracy over the SFT baseline.

Weaknesses

1. Limited Novelty: The core training paradigm, which combines SFT with subsequent RL optimization, is a common practice in the field. The contribution is thus an incremental (though effective) application of established methods rather than a fundamentally advancement. 2. The proposed hierarchical reward function is highly complex and task-specific, with numerous hard-coded weights and rules. This intricate, hand-crafted design is brittle and would be difficult to generalize to new simulation e

Reviewer 03Rating 4Confidence 4

Strengths

1. The idea of leveraging LLMs to simulate human decision-making in online shopping is interesting and promising. 2. The methodological presentation is generally clear, particularly the explanation of the reward design.

Weaknesses

My main concern is that the task definition is unclear, making it difficult to assess the task's difficulty or determine whether the benchmark genuinely reflects real-world challenges. In addition, the experimental results do not provide strong evidence supporting the effectiveness of the proposed method. Specifically: 1. Datasets: - The paper lacks sufficient detail on how the datasets were constructed, and provides no illustrative examples, making it difficult to assess task difficulty or d

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTransportation and Mobility Innovations